CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites

Huang, Xiaowu; Gao, Han; Li, Linna; Zhao, Yucheng; Men, Chen

doi:10.3390/app16020615

Open AccessArticle

CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites

by

Xiaowu Huang

¹,

Han Gao

^2,3,*

,

Linna Li

^1,2,3,

Yucheng Zhao

^2,3 and

Chen Men

^2,3

¹

Hubei Key Laboratory of Blasting Engineering, Jianghan University, Wuhan 430056, China

²

College of Science, Wuhan University of Science and Technology, Wuhan 430065, China

³

Hubei Province Intelligent Blasting Engineering Technology Research Center, Wuhan 430065, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 615; https://doi.org/10.3390/app16020615

Submission received: 5 December 2025 / Revised: 22 December 2025 / Accepted: 27 December 2025 / Published: 7 January 2026

Download

Browse Figures

Versions Notes

Abstract

To strengthen the safety management and control of explosives in underwater blasting construction sites, this study proposes an improved YOLOv11-based network named CRS-Y, designed to enhance the detection accuracy of explosives in complex underwater environments and improve the recognition capability of multi-scale targets. To address the limitations of traditional object detection methods in handling complex backgrounds and low-resolution targets, a lightweight re-parameterized vision transformer was integrated into the C3K module, forming a novel CSP structure (C3K-RepViT) that enhances feature extraction under small receptive fields. In combination with the Efficient Multi-Scale Attention (EMSA) mechanism, the model’s spatial feature representation is further strengthened, enabling a more effective understanding of objects in complex scenes. Furthermore, to reduce the computational cost of the P2 feature layer, a new convolutional structure named SPD-DSConv (Space-to-Depth Depthwise Separable Convolution) is proposed, which integrates downsampling and channel expansion within depthwise separable convolution. This design achieves a balance between parameter reduction and multidimensional feature learning. Finally, the Inner-IoU loss function is introduced to dynamically adjust auxiliary bounding box scales, accelerating regression convergence for both high-IoU and low-IoU samples, thereby optimizing bounding box shapes and localization accuracy while improving overall detection performance and robustness. Experimental results demonstrate that the proposed CRS-Y model achieved superior performance on the VOC2012, URPC2020, and self-constructed underwater blasting datasets, effectively meeting the real-time detection requirements of underwater blasting construction scenarios while exhibiting strong generalization ability and practical value.

Keywords:

underwater blasting; pyrotechnic detection; attention mechanism

1. Introduction

With the rapid development of national transportation, blasting engineering has been widely applied in fields such as underwater reef blasting, tunnel excavation, and mine mining. With the rapid advancement of national transportation infrastructure, blasting engineering has been extensively utilized in fields such as underwater rock plugging, tunnel excavation, and mining. However, safety management at blasting sites, particularly in operational environments involving large quantities of explosives and pyrotechnics, remains a critical focus of the engineering community [1]. Taking the “New Western LandSea Corridor,” a major national infrastructure project, as an example, channel excavation and port expansion frequently involve drilling and blasting operations. Due to complex and variable aquatic environments, diverse geological conditions, and expansive working faces, traditional safety inspection models—which rely heavily on manual supervision—suffer from limitations such as poor real-time performance, high omission rates, and lagged risk identification. In such complex engineering sites, misoperations or positional conflicts between explosives and personnel can easily lead to catastrophic accidents. Therefore, the introduction of deep learning technology to achieve real-time intelligent detection of key targets is of profound practical significance for enhancing the intrinsic safety level of large-scale water conservancy and transportation projects.

Despite its potential, the underwater blasting environment presents significant challenges, including intense light interference, cluttered backgrounds, and drastic variations in target scales. Traditional computer vision techniques struggle to meet practical demands due to insufficient feature extraction capabilities and low inference efficiency. Currently, extensive research has been conducted by scholars globally regarding object detection in complex scenarios. For instance, Ye et al. [2] improved YOLOv5s by introducing Varifocal Loss to mitigate class imbalance; however, this approach lacks a sufficient trade-off for simple sample consistency. Zhao et al. [3] utilized the HOSI module and SPFA attention mechanism to enhance the ability of YOLOv7 to capture the structure of underwater targets, but the stacking of modules significantly increased the computational overhead. Lei et al. [4] attempted to integrate the Swin Transformer into the YOLOv5 backbone, which improved multi-scale feature aggregation but resulted in insufficient model lightweighting, making real-time deployment at the edge difficult. In summary, existing methods still face a contradiction between “accuracy, speed, and deployment difficulty” in extreme environments like underwater blasting, leaving room for improvement—particularly in balancing small-scale explosive detection with inference efficiency.

To address these challenges, this paper proposes an improved object detection model, CRS-Y, based on the YOLOv11 framework. The model aims to achieve high-precision and high-efficiency target recognition within the complex context of underwater blasting operations. The primary contributions and innovations of this study are as follows:

Backbone Reconstruction for Enhanced Feature Extraction: By integrating the C3K-RVB structure with the Efficient Multi-Scale Attention (EMA) mechanism, the C3K module is reconstructed to significantly strengthen the model’s context awareness and feature capture capabilities for tiny explosive targets against complex backgrounds.
Optimization of Detection Head and Downsampling Strategy: The Space-to-Depth Convolution (SPDConv) is integrated into depthwise separable convolutions to replace conventional convolutions in the Neck network. This preserves fine-grained feature information and improves multi-scale detection performance while effectively reducing the model’s parameter count and computational complexity.
Refinement of Bounding Box Regression Accuracy: The Inner-IoU loss function is introduced to accelerate the regression process by controlling the scale of auxiliary bounding boxes. This accurately optimizes the localization of explosives and other targets, addressing the issue of significant localization deviations inherent in baseline models within complex, occluded environments.

2. Feature Analysis of Underwater Blasting Construction Sites

The engineering background of this study is based on the channel construction project of the Western Land–Sea New Corridor (Ping lu) Canal. This project mainly involves excavation and widening operations of the channel, among which the construction content of the 7th bid section includes onshore earth-rock blasting and underwater reef blasting, with a total earth-rock engineering quantity reaching 21.4116 million cubic meters. The operation points are widely distributed and the line is long, leading to great difficulties in construction organization and safety management. To strengthen on-site safety control, the operation surfaces of the construction site are classified, mainly including onshore operation areas and drilling-blasting ship deck operation areas, as shown in Figure 1. Among them, onshore operations need to focus on strengthening the management of high-risk items (such as initiating explosive material transport vehicles, detonators, explosives, etc.), while drilling-blasting ship deck operations also require the real-time monitoring of high-risk items (detonators, explosives, etc.) to ensure construction safety.

The two operation methods present distinct category distribution characteristics in target detection: targets to be detected in onshore earth-rock blasting operations are relatively scattered, with significant differences in sample scales; in contrast, targets on the operation surface of drilling-blasting ships are more concentrated, and the variation in sample scales is smaller. However, regardless of the operation method, the target detection task at the blasting construction site of large-scale waterway projects still faces many common challenges, such as strong light interference, target occlusion and overlap, and great difficulties in small target detection. To meet the detection requirements in the above-mentioned complex scenarios, this paper focused on the improvement of the target detection model framework to enhance its practical application effect in underwater blasting construction sites.

3. Model Selection and Model Improvement

3.1. Overview of Object Detection Algorithms

Object detection is a fundamental task in the field of computer vision, aiming to simultaneously localize target objects and identify their corresponding categories in images or videos [5]. In recent years, with the rapid advancement of deep learning technologies, object detection algorithms based on convolutional neural networks (CNNs) have achieved remarkable improvements in both detection accuracy and real-time performance. These advances have enabled their widespread adoption in applications such as autonomous driving, video surveillance, industrial inspection, and intelligent security systems [6]. From an architectural perspective, existing mainstream object detection methods can generally be categorized into two groups: two-stage detectors and one-stage detectors.

Two-stage detectors are typically represented by the R-CNN family, whose core idea is to decompose the detection process into two separate stages: region proposal generation and object classification with bounding box regression [7]. The original R-CNN employed selective search to generate candidate regions and performed feature extraction and classification for each region independently. Although this approach achieved relatively high detection accuracy, it suffered from excessive computational overhead [8]. To improve efficiency, Fast R-CNN and Faster R-CNN introduced shared feature extraction and region proposal networks, significantly reducing redundant computations and achieving a better trade-off between detection accuracy and computational efficiency [9].

In contrast to two-stage detectors, one-stage detectors formulate object detection as an end-to-end regression problem, directly predicting object categories and bounding box coordinates in a single forward pass, thereby substantially improving detection speed [10]. Representative one-stage methods include the YOLO series, SSD, and RetinaNet. The YOLO family divides the input image into grids and performs direct predictions on each grid, achieving extremely high inference speed and demonstrating outstanding performance in real-time detection tasks. SSD [11] employs multi-scale feature maps for detection, effectively enhancing its capability to detect objects of varying scales. RetinaNet introduces Focal Loss to alleviate the severe class imbalance between positive and negative samples commonly encountered in one-stage detectors, significantly improving the detection accuracy while preserving speed advantages [12].

Regarding the choice of detection framework, this study ultimately adopted the YOLOv11 one-stage detector, primarily due to its excellent balance between real-time performance and hardware adaptability. Although two-stage detectors such as the R-CNN family generally achieve higher accuracy, the computational latency introduced by their complex region proposal mechanisms makes them unsuitable for the stringent real-time safety warning requirements of underwater blasting scenarios. Furthermore, the anticipated deployment environments in this study include NVIDIA Jetson Orin series single-board computers (SBCs) (Santa Clara, CA, USA) installed on drilling and blasting vessel monitoring masts and mobile onshore monitoring platforms, as well as portable imaging terminals carried by on-site operators. By optimizing the backbone network, YOLOv11 enhances feature extraction capability while maintaining a highly lightweight design, ensuring smooth real-time video stream analysis even on resource-constrained hardware. Consequently, it effectively addresses the inherent trade-offs among detection accuracy, inference speed, and deployment complexity in challenging engineering environments.

3.2. YOLOv11 Model

YOLOv11 is the latest-generation target detection model in the YOLO series, released in early 2025, which aims to further improve detection accuracy and inference efficiency, and its network structure is shown in Figure 2. On the basis of inheriting the advantages of the YOLO series such as lightweight design and high accuracy, YOLOv11 has undergone multiple innovations and optimizations in its structure, making it particularly suitable for target detection tasks on resource-constrained devices and in complex environments. Rong Xueliang et al. [13] constructed an intelligent recognition model for concrete surface humidity based on the YOLOv11 algorithm, integrating a novel feature extraction module C3K2-IMSC to enhance the ability to identify the curing degree of concrete beams. Peng Liming et al. [14] proposed the MELE-YOLOv11n model, introducing depthwise separable convolution and an efficient multi-scale attention mechanism, which effectively strengthened the model’s ability to perceive the features of defective targets in the complex environment of mines.

The network design of YOLOv11 can be divided into four modules: Input layer, Backbone network, Neck network, and Head network [15]. YOLOv11 has introduced innovations in both the backbone network and detection head structures, further improving the performance and deployment efficiency of target detection. The backbone network incorporates the lightweight C3k2 module and C2PSA attention mechanism, which reduces computational complexity while enhancing feature extraction capabilities and the ability to focus on discriminative regions. The Neck part adopts the improved FPN [16] + PANet [17] structure and integrates a spatial attention mechanism, improving adaptability to small targets and complex backgrounds. The detection head employs a decoupled structure and introduces the enhanced DFL [18] + loss and category-adaptive BCE strategy [19], achieving optimizations in both regression accuracy and classification performance. In terms of training strategies, YOLOv11 utilizes Mosaic + data augmentation, a dynamic label assigner, and the EMA smoothing mechanism to enhance the generalization ability and stability of the model.

3.3. CRS-Y Model

The overall architecture of the proposed CRS-Y model for object detection in underwater blasting construction scenarios is illustrated in Figure 3. To address the specific challenges encountered in underwater blasting operations—namely complex illumination conditions characterized by the coexistence of strong reflections and shadowed regions, severe target overlap and occlusion, and a high proportion of small-scale explosive objects—the baseline model is systematically enhanced. The major improvements are summarized as follows:

Backbone Network Reconstruction and Feature Decoupling: Underwater blasting environments exhibit highly unstructured background characteristics, where uneven surface reflections often lead to significant confusion between targets and background textures. To mitigate this issue, the original C3K2 module is replaced with the proposed C3K-RVB module. By explicitly decoupling spatial feature modeling (Token Mixing) from channel-wise feature modeling (Channel Mixing), the network is able to more robustly disentangle complex background interference (spatial information) from the semantic attributes of targets (channel information), thereby significantly enhancing feature representation accuracy under adverse lighting conditions. In addition, the SE module is replaced with an EMA attention mechanism, which leverages cross-dimensional interactions to strengthen the recovery of degraded visual features (e.g., partially occluded targets) without introducing additional computational overhead.
Spatial Information Preservation in the Neck Network: In large-scale operational scenes, explosive objects located at long distances typically appear at extremely small scales, making them highly susceptible to feature loss during conventional convolutional downsampling. To alleviate this problem, SPDConv is embedded into depthwise separable convolutions within the Neck network. By transferring spatial pixel information into the channel dimension, this design avoids the information degradation caused by stride-based convolutions or pooling operations, thereby substantially improving feature retention and representational capability for small-scale targets.
Multi-Scale Loss Function Optimization: To address the large variation in target scales and the blurred object boundaries resulting from occlusion in construction scenes, an Inner-IoU loss function is incorporated into the detection head, together with a scale ratio factor. By generating auxiliary bounding boxes at different scales to participate in the loss computation, the model is able to more effectively capture fine-grained boundary information of multi-scale targets, leading to improved localization accuracy in complex and occlusion-heavy environments.

3.4. Design of the Backbone Network

3.4.1. Design of the C3K-RVB Structure

The backbone network of YOLOv11 is the core component responsible for feature extraction in the model. The YOLOv11 model proposed in this study is built based on the CSPDarknet53 architecture [20], and its key module, C3K2, is mainly used for feature fusion at different stages. However, when dealing with large-scale image data in complex environments such as initiating explosive materials at blasting sites, this module suffers from high computational overhead, which limits the real-time performance of the model.

To improve detection efficiency and feature perception capability, this paper introduces a lightweight reparameterized vision Transformer structure, RepViT [21], to replace the original C3K2 module and construct an optimized C3K-RVB structure. This structure integrates the global modeling capability of the Transformer-lite architecture and the feature perception advantages of the lightweight attention mechanism, enabling it to more effectively capture key regional features in small target detection and low-resolution image analysis tasks while significantly improving model inference efficiency. The traditional C3K2 module mainly relies on simple concatenation and addition operations for feature fusion, lacking a dynamic weight adjustment mechanism, and tends to weaken some key features during the fusion process [22]. In contrast, the C3K-RVB structure effectively enhances the feature representation capability by introducing a dynamic perception mechanism, making it more suitable for target detection tasks in underwater blasting construction scenarios that have high requirements for both detection accuracy and inference speed. Its structure is shown in Figure 4.

Among them, the RepViT module integrates the idea of the MetaFormer architecture in Transformer and performs modular reconstruction on the traditional lightweight CNN structure. Its core design concept is to explicitly separate spatial feature modeling (Token Mixing) and channel feature modeling (Channel Mixing) to improve representation capability and model flexibility. The specific structure of the RepViT module is shown in Figure 5.

The RepViTBlock is mainly composed of two parts:

The Token Mixer incorporates RepVGGDW + SE operations. This module first adopts RepVGG-style depthwise separable convolution (RepVGGDW) for spatial feature extraction to capture local texture and structural information. It is then connected to a Squeeze-and-Excitation (SE) module [23], which is used to enhance spatial attention distribution, thereby improving the ability to focus on target regions.

The Channel Mixer includes an MLP + Residual structure. The channel mixing part employs an MLP structure composed of two pointwise convolutions, with a GELU activation function inserted in between. The first layer is used for dimension expansion, and the second layer is for projecting back to the original dimension; moreover, its BatchNorm layer is initialized to 0 to ensure that the model exhibits an approximate identity mapping in the initial stage, thus stabilizing training. The entire MLP is encapsulated in a Residual module, which realizes element-wise addition of input and output, enhances gradient flow, and reduces information loss.

3.4.2. Design of the C3K-RVB-EMA Structure

In various computer vision tasks, the effectiveness of channel attention and spatial attention mechanisms in improving the quality of feature representation has been widely verified. However, channel attention mechanisms usually rely on channel dimensionality reduction to model cross-channel relationships, and this operation may introduce information loss when extracting deep visual representations, resulting in certain side effects. Although the SE attention mechanism adopted in the C3K-RVB structure in the previous section can enhance the response of “important channels”, its modeling capability is mainly concentrated on the channel dimension, making it difficult to effectively capture the “importance of spatial positions” in images. To further strengthen the spatial modeling capability of the model, this paper introduces a lightweight spatial attention mechanism, Efficient Multi-scale Attention (EMA) [24], into the improved C3K-RVB structure to effectively model spatial context information in images. The EMA module constructs attention maps through efficient multi-scale pooling operations, thereby improving spatial perception capability while maintaining computational efficiency. Its structure is shown in Figure 6.

The key idea of the EMA mechanism is to extract global information at multiple spatial scales and construct attention maps through simple and efficient pooling operations. The core process applies Global Average Pooling (GAP), Horizontal Average Pooling (HAP), and Vertical Average Pooling (VAP) to compress the input features at different scales. The three pooling methods are defined as follows. The global pooling operation produces a channel attention vector:

F_{g} = G A P (X) = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X (:, i, j)

(1)

The GAP operation sums all the pixels within each channel and takes the average. For a single channel,

H \times W - 1

addition operations are required. For all

C

channels, the total computational load is approximately

C \times H \times W

, and the time complexity of

G A P

is

O (C \times H \times W)

. Compared with traditional convolutional layers or fully connected layers (FC),

G A P

does not contain any Multiplication operations (Multiplication), and its computational overhead is extremely low. This is precisely the core reason why EMA can achieve “simple and efficient”.

Vertical pooling preserves the averaged features of each column:

F_{v} (j) = \frac{1}{W} \sum_{i = 1}^{W} X (:, i, j)

(2)

Horizontal pooling preserves the averaged features of each row:

F_{h} (j) = \frac{1}{H} \sum_{i = 1}^{H} X (:, i, j)

(3)

Here, X represents the input feature map,

X \in ℝ^{C \times H \times W}

,

F_{g} \in ℝ^{C \times 1 \times 1}

,

F_{v} \in ℝ^{C \times H \times 1}

.

3.5. Design of the Neck Network

3.5.1. Depthwise Separable Convolution

In the Neck network of the YOLO model, the basic convolution block is the core structure for feature extraction, which is usually composed of a standard convolution layer, batch normalization, and a non-linear activation function [25]. Standard convolution performs pixel-wise weighted summation on the input feature map through convolution kernels to extract semantic features of different scales and spatial positions. Despite its good performance in feature representation, it still has shortcomings of high computational overhead and limited receptive field, especially when deployed on resource-constrained devices or in scenarios requiring real-time performance. To this end, this paper attempts to use Depthwise Separable Convolution to replace some basic convolution blocks in the backbone network when constructing the model. This structure decomposes standard convolution into two independent operations: depthwise convolution is used to perform convolution operations on each input channel separately to preserve spatial features; pointwise convolution (via 1 × 1 convolution) integrates channel information to achieve cross-channel feature fusion. This structure effectively reduces the computational complexity and parameter quantity of the model while retaining the feature extraction capability to a certain extent. The structural diagram of depthwise separable convolution is shown in Figure 7.

3.5.2. SPD-DSConv

The PAFPN (Path Aggregation Feature Pyramid Network) architecture of YOLOv11 achieves cross-scale interaction through bidirectional feature fusion, but it has obvious shortcomings in small target detection. To address the computational burden caused by the traditional P2 layer, this study adopts SPDConv to process the P2 feature layer. Among them, SPDConv [26] adopts a symmetric positive definite matrix structure to realize multi-dimensional joint feature learning. Integrating SPDConv into depthwise separable convolution, this design has two major advantages compared with traditional convolution: fewer computational parameters and more complete retention of small-target information. Its structure is shown in Figure 8:

The principle of SPDConv is as follows:

x^{'} = C o n c a t (\begin{matrix} x [\dots, : 2, : : 2], & x [\dots, 1 : : 2, : : 2], \\ x [\dots, : 2, 1 : : 2], & x [\dots, 1 : : 2, 1 : : 2], \\ \dim = 1 \end{matrix})

(4)

output = Conv (x′, 1)

(5)

Here,

x [\dots, : 2, : : 2]

represents the selection of an element at a specific index position in the tensor x… It indicates selection on all non-specified dimension→ns, and ::2 indicates selection with a step size of 2. Here, rows and columns with even index positions in the tensor x are selected.

x [\dots, 1 : : 2, : : 2]

is to select rows with odd index positions and columns with even index positions.

x [\dots, : : 2, 1 : : 2]

is to select rows with even index positions and columns with odd index positions.

x [\dots, 1 : : 2, 1 : : 2]

is to select rows and columns with odd index positions.

By employing a symmetric positive-definite matrix, SPDConv reduces channel count by a factor of four while performing effective downsampling. Unlike conventional stride = 2 convolution or pooling operations, SPDConv preserves information from four distinct sampling positions and reorganizes them into the channel dimension, significantly improving the model’s perception of local textures and small objects. By replacing the 3 × 3 convolution in standard depthwise separable convolution with SPDConv, we obtain the SPD-DSConv module.

By means of such a symmetric positive definite matrix, the parameter count caused by a fourfold reduction in the number of channels is achieved. SPDConv not only realizes effective downsampling but also further enhances the feature representation capability. Compared with traditional stride = 2 convolution or pooling operations, SPDConv retains information from four different sampling positions, reorganizes it into the channel dimension, and then fuses it, which significantly improves the model’s ability to perceive local textures and small targets. The traditional depthwise separable convolution is reconstructed by replacing the 3 × 3 convolution in the depthwise separable convolution with SPDConv, resulting in the SPD-DSConv convolution.

3.6. Improvements in Detection Head Loss Function

The primary role of the loss function is to quantify the discrepancy between predicted values and ground truth, providing a measure of model performance. The original YOLOv11 loss function is primarily based on IoU, which exhibits limitations when handling complex backgrounds and multi-scale object detection. Specifically, IoU may neglect differences in bounding box shape and position, under-penalize out-of-box errors, and show insensitivity to small objects. In underwater blasting scenarios, the positions and aspect ratios of detonators vary widely, making standard IoU inadequate. Therefore, this study replaces it with Inner-IoU Loss [27] to address these issues effectively.

In this study, we adopted Inner-IoU Loss, introducing a scale factor, ratio, to control the size of auxiliary bounding boxes for loss computation. This approach better accommodates the requirements of target detection in underwater blasting construction sites. As illustrated in Figure 9, the ground truth (GT) box and the anchor box are denoted as

B^{g t}

and

B

, respectively. The center points of the GT box and its internal auxiliary box are represented as

(\begin{array}{l} x_{c}^{g t}, & y_{c}^{g t} \end{array})

, while the center points of the anchor box and its internal auxiliary box are denoted as

(x_{c}, y_{c})

. The width and height of the GT box are

w^{g t}

and

h^{g t}

, and those of the anchor box are

w

and h The variable ratio corresponds to the scale factor, which is typically set within the range [0.5, 1.5].

The definition of Inner-IoU is as follows:

b_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} \times r a t i o}{2}, b_{r}^{g t} = x_{c}^{g t} + \frac{w^{g t} \times r a t i o}{2}

(6)

b_{t}^{g t} = y_{c}^{g t} - \frac{h^{g t} \times r a t i o}{2}, b_{b}^{g t} = y_{c}^{g t} + \frac{h^{g t} \times r a t i o}{2}

(7)

b_{l} = x_{c} - \frac{w \times r a t i o}{2}, b_{r} = x_{c} + \frac{w \times r a t i o}{2}

(8)

b_{t} = y_{c} - \frac{h \times r a t i o}{2}, b_{b} = y_{c} + \frac{h \times r a t i o}{2}

(9)

i n t e r = (m i n (b_{r}^{g t}, b_{r}) - m a x (b_{l}^{g t}, b_{l})) \times (m i n (b_{b}^{g t}, b_{b}) - m a x (b_{t}^{g t}, b_{t}))

(10)

u n i o n = (w^{g t} \times h^{g t}) \times {(r a t i o)}^{2} + (w \times h) \times {(r a t i o)}^{2} - i n t e r

(11)

I o U^{i n n e r} = \frac{i n t e r}{u n i o n}

(12)

Inner-IoU inherits some properties of IoU while introducing its own characteristics. Like IoU, the range of Inner-IoU is [0, 1]. Because auxiliary boxes differ from actual boxes only in scale, the computation method is similar. The Inner-IoU-Deviation curve resembles the IoU-Deviation curve. Compared to IoU, when ratio < 1, the auxiliary box is smaller than the ground truth, resulting in a smaller effective regression range but a larger gradient magnitude, accelerating convergence for high-IoU samples. Conversely, when ratio > 1, larger auxiliary boxes expand the effective regression range, benefiting low-IoU regressions. This approach provides technical support for detecting explosives in blasting scenarios, stabilizes training, and improves overall detection accuracy.

4. Detection Testing and Result Analysis

4.1. Test Dataset

A dataset containing 11,200 images from blasting sites was collected, covering three categories of explosives and one essential detection category: detonators, explosives, explosive vehicles, and personnel. Among them, 1200 detonator samples were included, representing various foot line configurations encountered in real blasting scenarios. In addition, the COCO128 dataset was employed to preliminarily validate the effectiveness of the proposed model. Since underwater blasting sites involve submerged operations, the model also needs to demonstrate certain underwater object detection capability. Therefore, the CRS-Y model was further evaluated on the URPC2020 public dataset to assess its generalization performance. During training, multiple data augmentation techniques were applied to improve generalization and robustness, including Mosaic and Mixup augmentation, hue-saturation-value (HSV) adjustments, random scaling and translation, and horizontal flipping. Finally, the dataset was split into training and test sets with a ratio of 8:2.

4.2. Network Configuration

The experiments were conducted on a Windows 11 operating system using Python 3.8.0 and the PyTorch 1.12.1 deep learning framework. Training was performed on a NVIDIA GeForce GTX 5070 Ti GPU. For model optimization, the SGD optimizer was used, training for 200 epochs with a batch size of 16. The initial learning rate was set to 0.01%, momentum to 0.9, and weight decay to 0.0005.

4.3. Evaluation Metrics

To evaluate the performance of the proposed model, multiple metrics were employed, including Precision (P), Recall (R) [28], and Mean Average Precision (mAP) [29], to assess explosive detection at blasting sites. Specifically, P, R, and mAP are defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(13)

R e c a l l = \frac{T P}{T P + F N}

(14)

m A P = \frac{\sum_{i = 1}^{N} A P_{i}}{N}

(15)

Here,

T P

represents the number of explosive objects correctly detected by the network model;

F P

denotes the number of false positive predictions;

F N

indicates the number of undetected explosives;

N

is the total number of classes; and

A P

represents the average precision for a single detection class, defined as:

A P = \sum_{i = 1}^{n - 1} (r_{i + 1} - r) p_{i n t e r} (r_{i} + 1)

(16)

In the above formulation, all predicted bounding boxes are ranked in descending order according to their confidence scores, and precision is computed at each position. Finally, the precision values are weighted and summed to obtain the mean average precision (mAP).

4.4. Result Analysis

To comprehensively validate the target detection capability of the CRS-Y model in underwater blasting construction sites, comparative experiments were conducted under identical conditions using several state-of-the-art models.

4.4.1. Comparison of Different Loss Functions

To evaluate the impact of different loss functions on model performance, experiments were conducted considering the significant variations in aspect ratios of the detonator category within the dataset. The results demonstrate the effectiveness of the Inner-IoU loss. As shown in Figure 10, adopting the Inner-IoU loss allows the model to more accurately capture the elongated foot line of detonators. Furthermore, from the comparison in Table 1, it is evident that using the Inner-IoU loss not only improves the detection accuracy but also significantly enhances bounding box precision, with an increase of 3.6% in mAP_0.5, fully validating the effectiveness of incorporating the Inner-IoU loss function.

4.4.2. Comparison of Different Attention Mechanisms

To highlight the compatibility of the proposed re-parameterized structure with the EMA attention mechanism, experiments were conducted on the dataset collected from underwater blasting construction sites using several state-of-the-art attention mechanisms. To visually demonstrate the model’s focus on target regions, heatmaps of the modified model layers were generated (see Figure 11). The heatmaps clearly show that the EMA attention mechanism can more effectively concentrate on the regions corresponding to detection targets. Additionally, different attention mechanisms led to significant variations in the model’s training performance (see Table 2).

The results indicate that replacing the SE attention mechanism in certain C3K-RVB modules with EMA not only significantly improved the Recall and mAP_0.5, but also achieved the lowest computational cost among all compared methods, with GFLOPs reduced to 6.6, effectively lowering model complexity.

4.4.3. Comparison of Different Detection Algorithms

To validate the superiority of the proposed CRS-Y model over existing models on the custom dataset, comprehensive comparisons were performed under identical conditions with YOLOv5, YOLOv7 [37], YOLOX [38], Faster R-CNN [39], and Mask R-CNN [40]. The results are summarized in Table 3.

The experimental results indicate that the CRS-Y model achieved 43.2% on the mAP_0.5–0.95 metric, which was significantly higher than YOLOv5 (31.9%), YOLOv8 (33.5%), and YOLOX (33.8%), and also outperformed Faster R-CNN (30.1%) and Mask R-CNN (31.7%). In terms of recall, CRS-Y reached 59.7%, showing substantial improvement over YOLOv5 (44.2%), YOLOv8 (48.1%), YOLOX (48.7%), Faster R-CNN (43.5%), and Mask R-CNN (44.2%). Regarding model parameters, CRS-Y requires only 2.2 M, which is slightly higher than YOLOv5 (1.9 M) but lower than YOLOv8 (3.2 M), YOLOX (11.2 M), Faster R-CNN (14.3 M), and Mask R-CNN (16.2 M). These results indicate that CRS-Y significantly improves detection accuracy and recall while maintaining a lightweight architecture, balancing inference speed and deployment efficiency. Overall, CRS-Y outperforms other compared models across multiple performance metrics, making it particularly suitable for edge computing and resource-constrained scenarios.

To provide a more intuitive demonstration of CRS-Y’s adaptability for target detection in underwater blasting construction sites, visual comparisons across multiple detection models were conducted, as shown in Figure 12. In Figure 12, YOLOv5 and YOLOv8 exhibited missed detections, with limited recognition of small detonators. In some cases, YOLOv5 even failed to detect overlapping personnel. Faster R-CNN showed bounding box misalignment and overlapping issues when detecting personnel, suggesting that its non-maximum suppression mechanism is insufficient for underwater blasting scenarios. Although YOLOX, an improved decoupled version of YOLOv5, demonstrated strong detection accuracy, it still produced occasional false detections when applied to underwater blasting target detection. In contrast, CRS-Y could accurately capture small targets such as detonators as well as large targets like explosive vehicles, demonstrating its ability to precisely detect multi-scale objects in underwater blasting construction sites.

4.4.4. Visual Comparison of the Model on Different Datasets

To more intuitively demonstrate the improvements of the CRS-Y model, the training results on different datasets were analyzed, and detection results from selected test images were examined. Figure 13 illustrates the trend of mAP_0.5 for the CRS-Y model across different datasets, while Figure 14 presents the corresponding detection results.

As shown in Figure 13, under the same training conditions, the CRS-Y model demonstrated excellent performance across the self-collected dataset, URPC2020, and COCO128 datasets, indicating strong generalization capability. Specifically, as observed in subfigure (a), the CRS-Y model rapidly learns effective feature representations during the early stages of training on the self-collected dataset. Subfigures (b) and (c) show that the model achieves substantial improvements in detection accuracy on the public datasets.

As shown in Figure 14, the baseline models exhibited relatively low accuracy, with suboptimal detection performance and instances of both missed and incorrect detections. In contrast, the CRS-Y model demonstrated more accurate detection results, effectively addressing the challenges of target detection in underwater blasting construction sites. Furthermore, the model also shows advantages on public datasets, indicating that CRS-Y possesses strong robustness and generalization capability.

4.4.5. Ablation Study

To validate the effectiveness of the proposed enhancements, a series of ablation experiments were conducted on a large-scale waterway blasting excavation dataset. The results are summarized in Table 4, where A, B, and C represent the introduction of the Inner-IoU loss function, C3K-RVB module, and SPD-DSConv convolution module, respectively.

The results of multi-module combinations further demonstrated synergistic effects. In experiment 5, the combination of Inner-IoU loss and the C3K-RVB module increased mAP_0.5 to 60.8% and mAP_0.5–0.95 to 36.6%, showing moderate improvement compared to individual modules.

In contrast, experiment 6, which integrates all three enhancement modules, achieved the best performance, with mAP_0.5 reaching 66.6% (+8.9%) and mAP_0.5–0.95 reaching 43.2% (+9.7%). These results confirm the complementary and cumulative advantages of the proposed modules in improving detection precision and recall. In summary, the modular design of CRS-Y significantly enhances its detection capability in complex underwater blasting scenarios, balancing accuracy and computational efficiency, and fulfilling practical requirements for target detection in real-world underwater blasting construction sites.

5. Conclusions and Future Work

Targeting the issues of detection accuracy and multi-scale recognition for initiating explosive materials in the complex environment of channel blasting excavation sites in large-scale water transportation projects, the CRS-Y model proposed in this study has been optimized and improved in multiple aspects. The main conclusions are as follows:

The introduction of the Inner-IoU loss function effectively enhances the target localization accuracy, especially demonstrating outstanding performance in the regression tasks of targets with irregular shapes and varying scales. This loss function strengthens the model’s capability in detecting hard samples and small samples through a weighting mechanism.
The combination of the C3K-RepViT structure and the EMA mechanism significantly improves the model’s feature extraction and spatial feature expression abilities, endowing it with high detection accuracy in both complex backgrounds and multi-scale scenarios.
The proposed SPD-DSConv structure reduces the computational cost of the P2 feature layer while realizing an integrated design of downsampling and channel expansion. It enhances the joint expression capability of multi-dimensional features while ensuring the model’s lightweight property.
The collaborative integration of multiple modules achieved an excellent balance among accuracy, recall rate, and computational complexity. Consequently, the CRS-Y model outperforms mainstream detection models on multiple public datasets and the self-built dataset, exhibiting strong generalization ability and practical deployment value.

Although the CRS-Y model achieved substantial improvements in overall performance, several limitations remain. First, at the data level, the training and validation samples were primarily collected from the construction scenarios of the Western Land–Sea New Corridor (Ping lu) Canal Project. While these scenarios are representative to some extent, they are still geographically constrained in terms of geological structure and operational conditions. Consequently, the generalization capability of the model across regions with diverse geological formations and hydrological conditions requires further validation. Second, the model still exhibits sensitivity to environmental factors. Under extreme conditions such as highly turbid water or severely degraded illumination, the quality of visual information deteriorates significantly, which may lead to noticeable degradation in detection performance.

To address these limitations, future research will focus on the following directions:

Model Optimization and Efficiency Enhancement: Further streamlining and optimizing the network architecture to improve inference speed and energy efficiency on resource-constrained edge devices, while maintaining detection accuracy;
Advanced Compression and Acceleration Techniques: Investigating efficient model compression and inference acceleration strategies to achieve a better balance between high detection accuracy and lightweight deployment;
Multimodal and Spatiotemporal Information Fusion: Integrating multimodal sensory information with spatiotemporal feature modeling to enhance robustness and generalization in complex and dynamic underwater environments, thereby meeting the stringent real-time and reliability requirements of large-scale waterway blasting operations.

Author Contributions

Conceptualization, L.L.; Methodology, L.L.; Investigation, H.G.; Writing—original draft, X.H.; Writing—review and editing, Y.Z.; Supervision, C.M.; Project administration, L.L.; Funding acquisition, L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Hubei Key Laboratory of Blasting Engineering Foundation under Grant No. HKLBEF202009.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset from this study is available at https://osf.io/h34q6/overview?view_only=602effb080fc466e8015fcd1b794a11a (accessed on 26 December 2025); public datasets (UPRC2020 dataset) can be obtained at https://universe.roboflow.com/urpc2020/urpc2020 (accessed on 26 December 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, X.G. Blasting Design and Construction; Metallurgical Industry Press: Beijing, China, 2014. [Google Scholar]
Ye, Z.Y.; Liang, H.L.; Lan, C.D. Application of YOLOv5s algorithm model for underwater target detection. TV Technol. 2023, 47, 39–43. [Google Scholar]
Zhao, L.; Yun, Q.; Yuan, F.; Ren, X.; Jin, J.; Zhu, X. YOLOv7-CHS: An emerging model for underwater object detection. J. Mar. Sci. Eng. 2023, 11, 1949. [Google Scholar] [CrossRef]
Lei, F.; Tang, F.; Li, S. Underwater target detection algorithm based on improved YOLOv5. J. Mar. Sci. Eng. 2022, 10, 310. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Edozie, E.; Shuaibu, A.N.; John, U.K.; Sadiq, B.O. Comprehensive Review of Recent Developments in Visual Object Detection Based on Deep Learning. Artif. Intell. Rev. 2025, 58, 277. [Google Scholar] [CrossRef]
Malagoli, E.; Di Persio, L. 2D Object Detection: A Survey. Mathematics 2025, 13, 893. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE CVPR, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE ICCV, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot Multibox Detector. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE ICCV, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Rong, X.L.; Xie, A.Q.; Zhao, P.; Chen, W.; Wang, B.X. Design of intelligent maintenance system for concrete beams based on machine vision. China J. Highw. Transp. 2025, 38, 307–317. [Google Scholar]
Peng, L.P.; Zhao, B.T. Vertical shaft guide surface defect detection model based on MELE-YOLOv11n. J. Hubei Minzu Univ. (Nat. Sci. Ed.) 2025, 43, 376–381. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-scale Feature Learning for Object Detection. arXiv 2019, arXiv:1912.05384. [Google Scholar]
He, W.; Zhang, Y.; Chen, J. Multi-Scale Residual Aggregation Feature Pyramid Network (MSRA-FPN). Electronics 2022, 12, 93. [Google Scholar] [CrossRef]
Du, S.; Zhou, L.; Zhang, H. ASC-YOLO: Multi-Scale Feature Fusion and Adaptive Decoupled Head for Fracture Detection in Medical Imaging. Appl. Sci. 2025, 15, 9031. [Google Scholar] [CrossRef]
Xun, J.; Li, Q.; Zhou, K. An Efficient Algorithm for Pedestrian Fall Detection Based on Lightweight YOLO and Adaptive Binary Cross-Entropy. Sci. Rep. 2025, 15, 9036. [Google Scholar] [CrossRef]
Jing, C.L. Research and Implementation of Lightweight Network Embedded Machine Vision System for Traffic Monitoring. Master’s Thesis, Qilu University of Technology, Jinan, China, 2024. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.J.; Han, J.D.; Ding, G.G. RepViT: Revisiting Mobile CNN from ViT Perspective. arXiv 2023, arXiv:2307.09283. [Google Scholar]
Hu, Y.; Chen, Y.; Li, X.; Feng, J. Dynamic Feature Fusion for Semantic Edge Detection. arXiv 2019, arXiv:1902.09104. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Ouyang, D.L.; He, S.; Zhang, G.Z.; Luo, M.; Guo, H.; Zhan, J. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece, 4–10 June 2023. [Google Scholar]
Rozhbayani, G.; Tuama, A.; Al-Azzo, E. Social Distancing Monitoring by Human Detection Through Bird’s-Eye View Technique. In Proceedings of the VISIGRAPP: VISAPP, Rome, Italy, 27–29 February 2024. [Google Scholar]
Chen, L.; Yu, Z.; Yang, J. SPD-CNN: A plain CNN-based model using the symmetric positive definite (SPD) matrix. Front. Neurorobot. 2022, 16, 958052. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Xu, C.; Zhang, S.J. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Padilla, R.; Netto, S.L.; da Silva, E.A.B. A Survey on Performance Metrics for Object-Detection Algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 7–9 July 2020; pp. 237–242. [Google Scholar] [CrossRef]
Zangana, M.; Zangana, H.M. Survey and Performance Analysis of Deep Learning Based Object Detection in Challenging Environments. Sensors 2021, 21, 5116. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. arXiv 2020, arXiv:2005.03572. [Google Scholar] [CrossRef]
Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IoU Loss for Accurate Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Gu, R.; Wang, G.; Song, T.; Huang, R.; Aertsen, M.; Deprest, J.; Ourselin, S.; Vercauteren, T. CA-Net: Comprehensive Attention Convolutional Neural Networks for Explainable Medical Image Segmentation. arXiv 2020, arXiv:2009.10549. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11634–11642. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–24 June 2023. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]

Figure 1. Underwater blasting construction site. (a) The construction site of earthwork blasting on land; (b) underwater reef blasting construction site.

Figure 2. Architecture of the YOLOv11 model.

Figure 3. Overall architecture of the CRS-Y model.

Figure 4. Architecture of the C3K-RVB structure.

Figure 5. Structure of the RepViT module.

Figure 6. EMA attention mechanism.

Figure 7. Structure of depthwise separable convolution.

Figure 8. Structure of SPDConv. (a) Original input feature map; (b) Slicing process (Sub-sampling); (c) Mapping from space to depth; (d) Channel concatenation; (e) Convolution with a stride of 1.

Figure 9. Schematic diagram of Inner-IoU.

Figure 10. Detection heatmaps using different loss functions. ① Original image; ② Inner loss function; ③ CIOU loss function; ④ IOU loss function; ⑤ EIOU loss function; ⑥ GIOU loss function.

Figure 11. Detection heatmaps using different attention mechanisms. ① Original image; ② EMA attention mechanism; ③ SE attention mechanism; ④ CBAM attention mechanism; ⑤ CA attention mechanism; ⑥ ECA attention mechanism.

Figure 12. Visual comparison of multiple models on-site detection. (a) The construction site of earthwork blasting on lan; (b) Underwater reef blasting construction site; ① Original image; ② YOLOv5 model; ③ YOLOv8 model; ④ Faster R-CNN model; ⑤ YOLOX model; ⑥ CRS-Y model.

Figure 13. Comparison of training mAP_0.5 for the CRS-Y model across different datasets. (a) MyDATA; (b) URPC2020; (c) COCO128.

Figure 14. Comparison of detection results across different datasets.

Table 1. Experimental comparison of different loss functions.

Model	R/%	mAP_0.5/%	GFLOPs
IOU [30]	50.4	57.7	6.4
CIOU [31]	51.7	57.9	6.4
EIOU [32]	51.1	58.8	6.4
GIOU [33]	51.4	56.7	6.4
Inner-IoU	55.7	61.3	6.4

Table 2. Experimental comparison of different attention mechanisms.

Model	R/%	mAP_0.5/%	GFLOPs
CBAM [34]	52.1	60.6	7.5
CA [35]	50.6	59.8	7.0
ECA [36]	52.4	54.1	7.3
SE	51.2	56.9	6.8
EMA	56.8	62.8	6.6

Table 3. Experimental comparison of different detection algorithms.

Model	mAP_0.5–0.95/%	Recall/%	Params/M
YOLO v5	31.9	44.2	1.9
YOLOv8	33.5	48.1	3.2
YOLOX	33.8	48.7	11.2
Faster R-CNN	30.1	43.5	14.3
Mask R-CNN	31.7	44.2	16.2
CRS-Y	43.2	59.7	2.2

Table 4. Ablation Study Results.

Number	A	B	C	GFLOPS	mAP_0.5	mAP_0.5:0.95
1				6.4	57.7	33.5
2	√			6.4	61.3	33.6
3		√		5.9	62.8	37.6
4			√	6.2	60.2	35.2
5	√	√		6.1	60.8	36.6
6	√	√	√	6.6	66.6	43.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, X.; Gao, H.; Li, L.; Zhao, Y.; Men, C. CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites. Appl. Sci. 2026, 16, 615. https://doi.org/10.3390/app16020615

AMA Style

Huang X, Gao H, Li L, Zhao Y, Men C. CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites. Applied Sciences. 2026; 16(2):615. https://doi.org/10.3390/app16020615

Chicago/Turabian Style

Huang, Xiaowu, Han Gao, Linna Li, Yucheng Zhao, and Chen Men. 2026. "CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites" Applied Sciences 16, no. 2: 615. https://doi.org/10.3390/app16020615

APA Style

Huang, X., Gao, H., Li, L., Zhao, Y., & Men, C. (2026). CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites. Applied Sciences, 16(2), 615. https://doi.org/10.3390/app16020615

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CRS-Y: A Study and Application of a Target Detection Method for Underwater Blasting Construction Sites

Abstract

1. Introduction

2. Feature Analysis of Underwater Blasting Construction Sites

3. Model Selection and Model Improvement

3.1. Overview of Object Detection Algorithms

3.2. YOLOv11 Model

3.3. CRS-Y Model

3.4. Design of the Backbone Network

3.4.1. Design of the C3K-RVB Structure

3.4.2. Design of the C3K-RVB-EMA Structure

3.5. Design of the Neck Network

3.5.1. Depthwise Separable Convolution

3.5.2. SPD-DSConv

3.6. Improvements in Detection Head Loss Function

4. Detection Testing and Result Analysis

4.1. Test Dataset

4.2. Network Configuration

4.3. Evaluation Metrics

4.4. Result Analysis

4.4.1. Comparison of Different Loss Functions

4.4.2. Comparison of Different Attention Mechanisms

4.4.3. Comparison of Different Detection Algorithms

4.4.4. Visual Comparison of the Model on Different Datasets

4.4.5. Ablation Study

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI