1. Introduction
Mining activities worldwide can induce ground deformation and surface cracks, which threaten ecological systems, infrastructure stability, and mine safety. In coal-producing regions, including but not limited to western China, large-scale and high-intensity extraction makes timely crack monitoring especially important [
1,
2,
3,
4].
Traditionally, surface crack monitoring in mining areas has relied on manual surveys, interferometric synthetic aperture radar (InSAR), and satellite remote sensing interpretation [
5,
6]. However, the spatial resolution of InSAR and satellite imagery is often insufficient for detecting narrow and discontinuous cracks, while manual surveys are inefficient for large-scale applications. Moreover, complex terrain and environmental variability further reduce monitoring reliability [
7]. As a result, timely and accurate crack detection remains challenging, limiting the effectiveness of geological hazard assessment and early warning [
8]. Therefore, developing high-resolution and automated crack detection methods remains an urgent research priority.
With the rapid development of unmanned aerial vehicle (UAV) technology, low-altitude remote sensing has provided new opportunities for geological hazard monitoring in mining areas [
9]. Owing to their flexible deployment, high operational efficiency, and capability to acquire high-resolution imagery, UAVs have significantly improved the spatial accuracy and monitoring efficiency of surface crack detection [
10,
11]. Nevertheless, the processing and interpretation of UAV imagery remain challenging, particularly in the accurate extraction of crack information from large volumes of high-resolution images [
12]. Consequently, developing efficient and robust crack detection methods from UAV imagery has become a key research focus in mining geological hazard monitoring.
With the rapid development of deep learning, intelligent surface crack detection has become an effective approach for geological hazard monitoring in mining areas [
13]. Although deep learning models have achieved high accuracy in crack detection by learning complex image features [
14], mining-induced surface cracks are characterized by elongated morphologies, large-scale variations, and complex background interference [
15,
16]. As a result, existing methods still face three key challenges: insufficient extraction of elongated and discontinuous crack features, vulnerability to background interference, and inaccurate localization of small and slender targets. Therefore, developing a robust crack detection method for complex mining environments remains crucial for improving monitoring accuracy and efficiency.
Through extensive evaluations of convolutional structures, attention mechanisms, and loss functions in the You Only Look Once (YOLO) series [
17], optimal performance was achieved by integrating three modules. The main contributions are as follows:
A proposed SAConv module is introduced to enhance the multi-scale representation of mining-induced surface cracks.
A CGA attention mechanism is incorporated to improve feature discrimination under complex backgrounds.
Shape-IoU loss is adopted to improve localization accuracy for slender and irregular cracks.
A UAV-based dataset of 5000 annotated crack images is constructed for performance evaluation in mining subsidence areas.
3. Proposed YOLO11n Network Architecture
Compared with YOLOv8, YOLO11 introduces improvements mainly in feature extraction and fusion while retaining a similar detection head for framework compatibility. Specifically, the backbone replaces the C2f module with the more efficient C3k2 module within the CSPDarknet architecture [
40]. In the neck, YOLO11 maintains the PANet-based feature pyramid and further incorporates C3k2 modules to enhance multi-scale feature interaction and representation, thereby improving detection performance, particularly for small targets.
YOLO11n was selected as the baseline because it provides a compact one-stage detector with improved C3k2 feature extraction and PAN-FPN feature fusion while retaining high inference speed and straightforward deployment. Compared with heavier transformer-based detectors, RT-DETR-like frameworks, EfficientDet-style compound-scaled detectors, or segmentation networks, YOLO11n offers a practical balance between accuracy, training stability, and deployment simplicity for UAV crack inspection. Therefore, the enhanced model was built on YOLO11n to test whether crack-specific feature modules could improve detection while keeping the workflow compatible with lightweight YOLO deployment pipelines.
Based on YOLO11n, this study proposes an enhanced framework for surface crack detection in coal-mining subsidence areas. To address the challenges of multi-scale variation, slender morphology, and complex background interference, three improvements are introduced: a proposed SAConv module, a CGA mechanism, and the Shape-IoU loss function. The detection head retains the decoupled design of YOLO11, where classification and regression are performed separately to better exploit semantic and spatial information, while an optimized channel allocation strategy improves computational efficiency. The overall architecture of the proposed network is shown in
Figure 1.
The integration strategy is as follows: SAConv is embedded into selected C3k2 blocks in the backbone to form C3k2_SAConv, so that low- and middle-level feature maps can capture elongated crack continuity with adaptive receptive fields. CGA is inserted into the neck after multi-scale feature fusion, where P3, P4, and P5 feature levels carry small, medium, and context-rich crack information. The detection head remains decoupled and unchanged to preserve YOLO11n compatibility. This design limits architectural changes to the feature-extraction and feature-fusion stages, which are most relevant to discontinuous crack representation and background suppression.
3.1. Loss Function Shape-IoU Improvement
YOLO11 employs the CIoU loss function for bounding box regression. However, CIoU inadequately captures the shape and scale characteristics of small targets, leading to reduced localization accuracy in small-object detection scenarios [
41]. To address this issue, Shape-IoU, originally proposed as a shape- and scale-aware bounding-box regression metric, introduces shape-aware constraints by decoupling geometric properties from spatial location information, thereby improving regression stability and accuracy for slender crack targets [
42].
Compared with CIoU, DIoU, EIoU, and SIoU, Shape-IoU explicitly considers object shape and scale when measuring bounding-box regression errors. This is beneficial for mining cracks because their bounding boxes are often highly elongated and sensitive to small localization deviations along the narrow direction. In such cases, a loss function that penalizes shape inconsistency can improve localization stability even when the intersection area changes only slightly.
Here,
Ld and
LΩ denote the shape distance loss and shape value loss, respectively. By explicitly modeling shape and scale information, Shape-IoU improves localization stability for slender crack targets.
Here, W and H are the horizontal and vertical weighting factors, respectively; wgt and hgt denote the width and height of the ground-truth box; and s is a scale factor related to target size.
The structure of Shape-IoU is shown in
Figure 2. By introducing shape-aware weighting and directional distance penalties, Shape-IoU provides a more accurate representation of geometric discrepancies between predicted and ground-truth boxes, thereby improving localization performance for slender crack targets [
43].
3.2. Attention Mechanism Improvements
To alleviate attention redundancy in conventional MHSA, the CGA module, derived from EfficientViT [
44], is integrated into the neck network. CGA performs attention computation in grouped channel subspaces and progressively aggregates inter-group information through cascaded attention operations, thereby enhancing feature representation for slender and discontinuous crack structures.
Here, Xj and Yj denote the input and output of the j-th attention head, respectively.
By enhancing information interaction among attention heads, CGA improves feature discrimination and suppresses background interference with limited computational cost [
45]. The architecture is shown in
Figure 3.
3.3. Convolution Enhancement
Because surface cracks exhibit elongated and discontinuous structures, adaptive receptive fields are beneficial for capturing long-range contextual information. Therefore, SAConv is incorporated into the backbone network to replace selected standard convolutions. By combining standard and atrous convolutions, SAConv dynamically adjusts the receptive field for multi-scale feature extraction [
46].
Atrous convolution is suitable for slender crack extraction because it enlarges the receptive field without reducing feature-map resolution, helping the network connect discontinuous crack segments and capture long-range linear context [
47]. Compared with deformable convolution, SAConv provides a more controlled receptive-field expansion and lower geometric instability for thin structures whose boundaries are weak and fragmented. Deformable convolution is powerful for irregular object shapes, but its learned offsets may be influenced by vegetation, shadows, and soil textures in UAV scenes. Therefore, SAConv was selected to strengthen context aggregation while preserving stable crack geometry.
Here, r denotes the dilation rate, Δw is the learnable weight offset, and s(x) is the adaptive switching function.
SAConv adaptively adjusts receptive fields through dynamic dilation selection, enabling more effective extraction of elongated crack features and improved robustness to scale variations while maintaining computational efficiency. In this study, SAConv is integrated into the C3k2 module to form C3k2_SAConv, as illustrated in
Figure 4.
4. Experimental Results and Analysis
4.1. Research Region Overview
Zhungeer Banner is located in the eastern part of southwestern Inner Mongolia and is bordered by the Yellow River on three sides. It is commonly referred to as “Jiming San Sheng” [
48]. The Zhungeer Banner coalfield is situated within the arid and semi-arid region of northern China, where the ecological environment is fragile, and water resources are limited. Continuous large-scale coal mining has induced extensive surface subsidence, significant declines in groundwater levels, and increasingly severe ecological and environmental degradation. The study area is located in the northeastern part of Zhungeer Banner, covering approximately 6.0 km
2 northeast of the Bulian Gou Coal Mine.
The major forms of ground subsidence in the mining area include surface cracks, subsidence trenches, and collapse pits. Among these, surface cracks are the most extensively developed [
49], with widths ranging from 10 to 120 cm. These cracks commonly exhibit a parallel and stepped distribution pattern, with step heights generally varying between 15 and 130 cm and reaching a maximum of 2.5 m. In plan view, the cracks are predominantly curvilinear, although locally linear features are also observed.
4.2. Experimental Dataset
This study employed a Pegasus D20 unmanned aerial vehicle (UAV) (Feima Robotics Co., Ltd., Shenzhen, China) equipped with a D-OP3000 five-lens oblique photogrammetry system (Feima Robotics Co., Ltd., Shenzhen, China) for data acquisition. UAV flights were conducted between 09:00 and 15:00 to minimize the influence of shadows on aerial imagery. The flight altitude was maintained at 400 m above the take-off point, resulting in a ground sampling distance (GSD) of 5–7 cm across the survey area. The side overlap and forward overlap were set to 65% and 80%, respectively.
Each flight mission lasted less than 50 min, and a total of four flights were completed, covering an aerial survey area of approximately 6.071 km
2. The UAV flight scheme is illustrated in
Figure 5.
The dataset was constructed using UAV-acquired remote sensing imagery. To enhance model performance and improve data diversity, data augmentation was applied to the calibrated samples by adjusting brightness and contrast with scaling factors of 0.25, 0.5, and 0.75, resulting in a dataset of 5000 surface crack images.
Color-space augmentation was selected because illumination variation, shadows, exposed soil brightness, and vegetation-background contrast are major sources of uncertainty in UAV images from the study area. Geometric transformations were not used in this revision because crack orientation, continuity, and scale are directly related to the physical morphology of mining-induced ground fissures; aggressive rotation, scaling, or warping may introduce samples that are less consistent with the photogrammetric scene geometry. Nevertheless, we agree that moderate geometric augmentation may further improve dataset balance and model robustness, and this will be investigated in future work.
After ensuring that the image GSD met the requirements for monitoring tasks, the dataset was randomly divided into training, validation, and test sets in a ratio of 7:2:1. Subsequently, LabelMe (version 5.3.1; MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA) was employed to annotate the semantic segmentation dataset, and the annotated results were converted into text files, representing surface crack detection as point-based coordinate sequences in each image. The 7:2:1 division corresponds to 3500 training images, 1000 validation images, and 500 test images.
To improve reproducibility, crack annotations were generated in LabelMe as point-based coordinate sequences and then converted to the training-label format required by the detector. The annotation guideline included visible continuous, discontinuous, curvilinear, and stepped fissures, while excluding roads, vegetation boundaries, shadows, and exposed-soil textures that did not correspond to crack morphology. Ambiguous samples were rechecked during label conversion and dataset cleaning. The verified quantitative descriptors comprise 5000 images divided into 3500 training, 1000 validation, and 500 test images, a ground sampling distance of 5–7 cm, field-observed crack widths of 10–120 cm, and brightness/contrast scaling factors of 0.25, 0.5, and 0.75. Detailed image-level crack-length distributions and instance-level size or class-imbalance statistics were not retained and are therefore reported as a limitation rather than estimated retrospectively.
4.3. Experimental Environment and Parameter Configuration
The hardware–software configuration was as follows: Windows 10 Pro, PyTorch 2.0.1, Python 3.11.7, and CUDA 12.9, running on a PC (Dell Technologies Inc., Round Rock, TX, USA) with i9-10900X and NVIDIA RTX 3090. Using the YOLO11n framework, the proposed model was trained on 1080 × 1080 drone images with a batch size of 28 over 300 epochs, starting with a learning rate of 0.01 and momentum set to 0.9.
The training strategy used stochastic gradient descent with an initial learning rate of 0.01, momentum of 0.9, and scheduled learning-rate decay over 300 epochs. All models were initialized from YOLO11n pretrained weights and trained using 1080 × 1080 inputs with a batch size of 28 under Windows 10 Pro, Python 3.11.7, PyTorch 2.0.1, and CUDA 12.9 on an Intel i9-10900X CPU and an NVIDIA RTX 3090 GPU. Five independent runs with different random seeds were used for statistical validation. No early stopping was applied, model selection was based on validation-set performance, and the test set was used only for the final comparison. To ensure a controlled comparison, all competing models used the same data split, augmentation settings, input size, initialization strategy, hardware/software environment, training budget, and evaluation pipeline.
4.4. Experimental Evaluation Criteria
The model was evaluated using precision, recall, mean average precision at an intersection-over-union threshold of 0.5 (mAP@0.5), number of parameters, floating-point operations (FLOPs), frames per second (FPS), and model size. Precision, recall, and mAP@0.5 were used to evaluate detection accuracy, whereas parameters, FLOPs, FPS, and model size were used to determine computational efficiency. FPS was measured on the same hardware platform using 1080 × 1080 input images, so the reported values reflect relative inference efficiency under identical test conditions:
Additional indicators were considered to support a more comprehensive interpretation. The F1-score can be derived from precision and recall, while mAP@0.5:0.95 and confusion-matrix analysis are useful for stricter localization and error analysis. ROC analysis is less directly applicable to the one-stage object-detection setting because detections depend on confidence thresholds and non-maximum suppression. These indicators will be incorporated in future expanded evaluations.
The F1-score was calculated from the reported precision and recall values using F1 = 2PR/(P + R) and was added to the comparative results. The proposed model achieved an F1-score of 81.6% on the mining-area UAV dataset, compared with 78.4% for the baseline YOLO11n, and 78.4% on Crack500, compared with 74.4% for YOLO11n. The archived experimental summaries contain mAP@0.5 but do not contain the complete threshold-wise AP outputs required for a valid COCO-style mAP@0.5:0.95 calculation. Because mAP@0.5:0.95 cannot be reconstructed reliably from mAP@0.5 alone, it was not inferred or estimated retrospectively. Future evaluations will retain complete threshold-wise outputs and report this metric using the COCO-style protocol.
In Equations (9) and (10), TP denotes correctly detected crack targets, FP denotes background or non-crack regions incorrectly detected as cracks, and FN denotes missed crack targets. Precision evaluates the reliability of positive detections, whereas recall measures the ability to avoid missing actual cracks.
4.5. Ablation Experiment Results and Analysis
Ablation studies were conducted to assess the performance gains contributed by the proposed modules in UAV aerial object detection. The SAConv, CGA mechanism, and Shape-IoU loss function were progressively integrated into the baseline YOLO11n model. The results are summarized in
Table 2.
A controlled comparison between Configuration D (SAConv + CGA with CIoU) and the proposed model (SAConv + CGA with Shape-IoU) isolates the effect of the regression loss while keeping the feature-extraction architecture fixed. Replacing CIoU with Shape-IoU increased recall from 76.8% to 77.9% (+1.1 percentage points) and mAP@0.5 from 83.9% to 84.3% (+0.4 percentage points), while precision changed from 86.2% to 85.6% (−0.6 percentage points). Thus, Shape-IoU provides a complementary improvement in sensitivity and overall localization performance for slender cracks, although its standalone contribution is modest and does not uniformly improve all metrics.
Configuration A: Replacing CIoU with Shape-IoU enhanced precision by 0.7%, while the model size, number of parameters, and FLOPs remained unchanged. However, recall, mAP@0.5, and FPS decreased by 1.1%, 0.2%, and 4.08%, respectively.
Configuration B: Introducing the CGA mechanism increased recall and mAP@0.5 by 1.9% and 0.7%, respectively, with no change in model size or FLOPs. The parameter count was reduced by 0.03 M, while precision and FPS decreased by 1.5% and 18.92%, respectively.
Configuration C: Incorporating the SAConv module improved recall and mAP@0.5 by 5.4% and 2.3%, respectively. However, it significantly increased computational complexity, with parameters, model size, and FLOPs rising by 9.59 M, 18.34 MB, and 28.4 G, respectively, while precision decreased by 0.8% and FPS dropped by 51.57%.
Configuration D: The combined use of SAConv and CGA improved precision, recall, and mAP@0.5 by 2.0%, 3.5%, and 2.8%, respectively. This gain came at the cost of increased computational load, with parameters, model size, and FLOPs increasing by 9.49 M, 18.14 MB, and 28.4 G, while FPS decreased by 55.57%. Relative to YOLO11n, the proposed model uses approximately 4.35 times as many parameters and 3.78 times as many FLOPs, but it still processes 57.2 frames per second on the RTX 3090. This operating point is appropriate for post-flight or semi-real-time UAV inspection, whereas the baseline remains preferable when edge-device latency and storage are the dominant constraints.
Proposed Model: By integrating SAConv, CGA, and Shape-IoU, the proposed model achieves improvements of 1.4% in precision, 4.6% in recall, and 3.2% in mAP@0.5 compared with the baseline YOLO11n. In practical mining-crack monitoring, the 4.6% recall improvement is particularly important because missed detections may delay field verification of active fissures and potential subsidence hazards. The improvement also indicates better continuity perception for slender and discontinuous cracks in complex UAV backgrounds. These gains are accompanied by increased computational cost, with the number of parameters increasing from 2.83 M to 12.32 M, FLOPs from 10.2 G to 38.6 G, model size from 5.76 MB to 23.9 MB, and FPS decreasing from 127.4 to 57.2. Therefore, the proposed model is more suitable for offline or semi-real-time UAV inspection workflows where improved crack recall and localization are prioritized over ultra-lightweight deployment.
Quantitatively, the FPS decrease from 127.4 to 57.2 corresponds to an approximate inference-latency increase from 7.85 ms to 17.48 ms per image on the same RTX 3090 platform. Thus, the proposed model is about 2.23 times slower than the baseline while improving mAP@0.5 by 3.2% and recall by 4.6%. Memory usage and energy consumption were not measured in this study; therefore, the computational-cost discussion is limited to parameters, FLOPs, model size, FPS, and derived latency.
Among the three updates, SAConv has the greatest effect on recall and mAP@0.5 because the adaptive receptive field improves the representation of elongated and discontinuous crack structures. However, it also introduces the largest increase in parameters and FLOPs. CGA mainly improves feature discrimination and background suppression with a moderate inference-speed cost, whereas Shape-IoU improves shape-aware localization but provides limited improvement when used alone. The final model therefore represents an accuracy–efficiency trade-off rather than a purely lightweight alternative to YOLO11n.
To further interpret the performance gains, Grad-CAM visualizations were generated for both YOLO11n and the proposed model, as shown in
Figure 6. The purpose of
Figure 6 is not to claim that all target cracks are invisible to the naked eye; rather, it illustrates that even visually identifiable cracks may be difficult for detectors under vegetation cover, shadows, soil texture interference, and discontinuous crack boundaries. The baseline YOLO11n exhibits dispersed activations and noticeable responses to vegetation and background textures. In contrast, the proposed model produces more concentrated activations along crack structures, indicating that the combination of SAConv and CGA enhances crack feature perception while suppressing irrelevant background interference.
Figure 7 presents feature maps of different network configurations. Compared with YOLO11n, SAConv enhances crack continuity representation, while the addition of CGA further suppresses background interference and strengthens crack feature discrimination. The visualization results agree with the quantitative improvements in
Table 2, confirming the effectiveness of the proposed framework in complex mining environments.
4.6. Comparison Results and Analysis with Other Algorithms
To evaluate the performance of the proposed model for surface crack detection, several state-of-the-art object detection models were compared with the baseline YOLO11n, as reported in
Table 3. The F1-score is calculated from the reported precision and recall values.
The current quantitative comparison focuses on lightweight YOLO-family detectors because they share the same bounding-box output format and can be trained under comparable settings. Transformer-based detectors and segmentation-based crack methods are valuable alternatives, but direct comparison requires consistent annotation formats, scale handling, and evaluation protocols. Therefore, they are discussed qualitatively in the literature comparison table, while future work will include controlled experiments with RT-DETR-like detectors and segmentation networks.
The proposed model achieved the highest F1-score (81.6%) among the compared YOLO-family detectors, exceeding the baseline YOLO11n (78.4%) by 3.2 percentage points. This result indicates that the improvement in recall was achieved while maintaining high precision.
To further demonstrate the advantages of the proposed algorithm,
Figure 8 presents a comparison of the precision (P), recall (R), and mAP@0.5 curves between the original and proposed YOLO11n during training. Both models converged after approximately 300 epochs, while the proposed model consistently outperformed the baseline in terms of P, R, and mAP@0.5.
To evaluate the robustness of the proposed framework, five independent training runs with different random seeds were conducted.
Table 4 reports the mean and standard deviation of precision, recall, and mAP@0.5. The proposed model consistently achieved superior performance with lower variance than the baseline YOLO11n.
Independent-sample
t-tests further confirmed that the improvements in all evaluation metrics were statistically significant (
p < 0.05). The boxplots in
Figure 9 illustrate the distribution of results across repeated experiments, demonstrating the stability and reliability of the proposed framework.
For the five-run statistical analysis, normality was assumed approximately because all runs used identical data splits and training settings while varying only random seeds. The reported p-values indicate statistically significant improvements, and the lower standard deviations suggest improved stability. In future work, confidence intervals and effect-size statistics will be reported together with p-values to further strengthen statistical interpretation.
4.7. Generalization Evaluation on the Crack500 Dataset
To further evaluate the generalization capability of the proposed framework, cross-dataset experiments were conducted using the publicly available Crack500 dataset [
34]. The model was trained on the self-constructed mining crack dataset and directly tested on Crack500 without additional fine-tuning. Because Crack500 is a close-range pavement crack dataset rather than a high-altitude UAV mining dataset, this experiment should be interpreted as a cross-domain stress test rather than proof of full operational transferability.
As shown in
Table 5, the proposed model achieved the highest precision, recall, and mAP@0.5 among all compared methods. Specifically, the proposed framework improved mAP@0.5 by 5.1% compared with the baseline YOLO11n, demonstrating superior transferability across different crack scenarios. The F1-score is calculated from the reported precision and recall values.
The scale space, background texture, and ground sampling distance differ substantially between the mining UAV dataset and Crack500. No scale alignment or domain adaptation was applied in this experiment; therefore, the reported gain mainly indicates that the proposed modules improve relative robustness compared with the baseline under the same cross-domain protocol. Broader validation across multiple mining regions and UAV flight conditions is still required before claiming general deployment capability.
On Crack500, the proposed model achieved the highest F1-score (78.4%), compared with 74.4% for YOLO11n. This 4.0-percentage-point improvement supports the cross-domain robustness of the proposed feature-extraction and localization strategy.
Figure 10 presents representative detection results on Crack500. Compared with the baseline model, the proposed framework exhibits stronger continuity perception for elongated crack structures and better resistance to background interference, particularly under shadow and texture-rich conditions.
These results indicate that the integration of SAConv, CGA, and Shape-IoU enhances feature representation robustness and improves the generalization capability of the network beyond mining-specific datasets.
4.8. Ablation Study and Comparative Analysis
To evaluate the effectiveness of the proposed improvements, systematic ablation and comparative experiments were conducted based on YOLO11n. The results show that SAConv, CGA, and Shape-IoU contribute differently to crack detection performance.
Among the three modules, SAConv provides the most significant improvement in recall and mAP@0.5. This improvement mainly results from its adaptive receptive field, which enhances multi-scale feature extraction and continuity perception for elongated and discontinuous crack structures. CGA further improves detection performance by suppressing responses to vegetation, shadows, and complex surface textures, thereby reducing background interference and strengthening crack feature representation. Although Shape-IoU contributes less to overall accuracy, it improves localization stability by incorporating shape-aware geometric constraints into bounding-box regression.
By integrating all three modules, the proposed model achieves the best overall performance. Compared with the baseline YOLO11n, precision, recall, and mAP@0.5 increase by 1.4%, 4.6%, and 3.2%, respectively. The improvement is particularly evident for small, slender, and low-contrast crack targets, which are commonly observed in mining subsidence areas.
Comparative experiments further demonstrate the superiority of the proposed framework. While YOLO8n achieves competitive performance, its detection accuracy remains lower than that of the proposed model. YOLO12n exhibits the weakest overall performance. Considering detection accuracy, model complexity, and computational cost jointly, the proposed model achieves the most favorable balance between effectiveness and efficiency.
The training curves of precision, recall, and mAP@0.5 show stable convergence, with all metrics gradually stabilizing after approximately 300 epochs. In addition, the Grad-CAM and feature-map visualizations presented in
Figure 6 and
Figure 7 indicate that the proposed model focuses more accurately on crack regions and suppresses irrelevant background responses. Representative results on the Crack500 dataset (
Figure 10) and UAV imagery from the study area (
Figure 11) further show fewer false positives and missed detections compared with YOLO11n, particularly for elongated and discontinuous cracks under complex environmental conditions.
Overall, the proposed framework achieves more accurate crack localization and detection while maintaining stable performance, demonstrating its applicability for UAV-based surface crack monitoring in mining areas.
5. Discussion
A deeper interpretation of the results indicates that the proposed architecture improves crack detection mainly by strengthening long-range context, suppressing background interference, and improving slender-box localization. However, these gains are accompanied by a clear computational trade-off, so the method should be considered an accuracy-oriented enhancement rather than a strictly lightweight detector.
The revised results also show that the proposed method should be interpreted as an accuracy-oriented enhancement rather than a lightweight model. The increase in parameters and FLOPs is justified for UAV-based inspection tasks where images can be processed after flight or on a workstation, but lightweight pruning, knowledge distillation, and edge-device deployment remain necessary for fully real-time field applications.
The proposed framework enhances surface crack detection in mining areas by integrating SAConv, CGA, and Shape-IoU into YOLO11n. Experimental results demonstrate improved detection performance on both the self-constructed mining crack dataset and the Crack500 dataset, indicating good adaptability to different crack scenarios.
Compared with recent UAV-based crack detection studies [
16,
17,
23,
32], the proposed framework achieves higher recall and mAP@0.5, but this improvement is obtained at the cost of substantially increased computational complexity. In particular, the proposed model improves mAP@0.5 by 3.2% over the baseline YOLO11n and demonstrates stronger cross-dataset performance on Crack500. These improvements suggest that the proposed framework is more effective in handling thin, elongated, and discontinuous crack structures, but it is not a strict nano-level lightweight model after SAConv is introduced.
The principal error mechanisms can be grouped into three categories: (i) visual ambiguity, where shadows, tire tracks, and erosion boundaries resemble cracks; (ii) incomplete visibility, where vegetation or severe shadow interrupts crack continuity; and (iii) insufficient spatial evidence, where narrow, low-contrast cracks approach the effective image-resolution limit. Potential remedies include environment-stratified sampling, hard-negative mining, moderate geometry-preserving augmentation, and multi-region validation. Because the available test records were not tagged by environmental category, category-specific error rates are not reported, and this remains a target for future work.
One notable advantage of the proposed framework is its robustness under complex environmental conditions. Mining surface images often contain vegetation, shadows, exposed rocks, and heterogeneous textures, which can interfere with crack identification. The incorporation of CGA improves feature discrimination and reduces background interference, enabling more reliable crack detection under challenging conditions. This advantage is further supported by the Grad-CAM and feature-map visualizations, which show more concentrated responses along crack regions and reduced activation in irrelevant background areas.
Failure cases were associated primarily with three mechanisms. First, visual ambiguity caused false positives where strong shadows, vegetation boundaries, tire tracks, and erosion textures resembled crack edges. Second, incomplete visibility under dense vegetation or shadow increased false negatives for partially occluded cracks. Third, insufficient spatial evidence reduced sensitivity to very narrow, highly fragmented, discontinuous, or low-contrast fissures. These observations highlight the need for environment-stratified sampling, hard-negative mining, moderate geometry-preserving augmentation, and multi-region validation. Because the retained test records were not tagged by environmental category, category-specific error rates are not reported.
The proposed framework also exhibits improved sensitivity to small, slender, and discontinuous cracks. Surface cracks in mining subsidence areas are typically characterized by elongated morphology, large-scale variations, and weak visual contrast. By adaptively adjusting the receptive field, SAConv enhances multi-scale feature extraction and continuity perception, allowing the network to better capture crack structures of varying sizes. Meanwhile, Shape-IoU introduces shape-aware constraints into bounding-box regression, improving localization stability for irregular crack targets. These improvements collectively contribute to the enhanced localization accuracy and detection performance observed across all experiments.
Despite these advantages, several limitations remain. Although the proposed model achieves higher detection accuracy, the introduction of SAConv substantially increases computational complexity, resulting in a 55.1% reduction in inference speed compared with the baseline YOLO11n. This trade-off may limit its deployment in resource-constrained edge devices or real-time monitoring systems. Furthermore, the dataset used in this study was collected from specific mining areas, and its geological and environmental diversity remains limited. Therefore, the model’s generalization capability under extreme conditions, such as dense vegetation cover, severe ground deformation, or highly complex terrain backgrounds, requires further validation.
Because all UAV images were collected from the Zhungeer mining area, the current dataset may reflect regional geological and environmental characteristics. Crack morphology in central-eastern mining areas, the Shendong mining area, Xinjiang mining areas, and southwestern mining areas may differ in soil color, vegetation coverage, fracture scale, and deformation pattern. Therefore, the current results should not be interpreted as fully representative of all mining regions; multi-region datasets are required to verify broader generalization.
Future work will focus on reducing computational overhead while maintaining detection accuracy. Lightweight network design, model compression, and knowledge distillation techniques may be explored to improve deployment efficiency. In addition, the integration of multi-source data, such as UAV imagery, LiDAR, and multispectral information, may further enhance the robustness and generalization capability of crack detection in complex mining environments.