Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN

Hu, Yu; Deng, Naifu; Ye, Fan; Zhang, Qinglong; Yan, Yuchen

doi:10.3390/buildings15193516

Open AccessArticle

Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN

by

Yu Hu

^1,2,

Naifu Deng

²

,

Fan Ye

³,

Qinglong Zhang

³ and

Yuchen Yan

^3,*

¹

State Key Laboratory of Hydroscience and Engineering, Tsinghua University, Beijing 100084, China

²

Department of Hydraulic Engineering, Tsinghua University, Beijing 100084, China

³

School of Future Cities, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(19), 3516; https://doi.org/10.3390/buildings15193516

Submission received: 23 August 2025 / Revised: 16 September 2025 / Accepted: 26 September 2025 / Published: 29 September 2025

(This article belongs to the Special Issue Recent Scientific Developments in Structural Damage Identification)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of multi-scale distribution, low contrast and background interference in rock crack identification, this paper proposes an improved Mask R-CNN model (CBAM-BiFPN-Mask R-CNN) that integrates the convolutional block attention mechanism (CBAM) module and the bidirectional feature pyramid network (BiFPN) module. A dataset of 1028 rock surface crack images was constructed. The robustness of the model was improved by dynamically combining Gaussian blurring, noise overlay, and color adjustment to enhance data augmentation strategies. The model embeds the CBAM module after the residual block of the ResNet50 backbone network, strengthens the crack-related feature response through channel attention, and uses spatial attention to focus on the spatial distribution of cracks; at the same time, it replaces the traditional FPN with BiFPN, realizes the adaptive fusion of cross-scale features through learnable weights, and optimizes multi-scale crack feature extraction. Experimental results show that the improved model significantly improves the crack recognition effect in complex rock mass scenarios. The mAP index, precision and recall rate are improved by 8.36%, 9.1% and 12.7%, respectively, compared with the baseline model. This research provides an effective solution for rock crack detection in complex geological environments, especially the missed detection of small cracks and complex backgrounds.

Keywords:

rock fracture recognition; Mask R-CNN; CBAM; BiFPN

1. Introduction

Rock masses are characterized by a network of fractures formed by structural planes such as faults and joints, which are discontinuous and heterogeneous. If the structure is damaged, the expansion of fractures can lead to engineering problems such as instability of the surrounding rock mass. Therefore, efficient fracture identification is particularly important. Traditional fracture identification methods mainly rely on manual visual inspection, contact measurement tools (such as feeler gauges and calipers), and simple marker tracking methods [1]. Although these methods are intuitive and low-cost, they generally have significant shortcomings such as low efficiency, strong subjectivity, difficulty in accurate quantification (especially for fine or complex cracks), and inability to obtain three-dimensional information and internal defects [2]. To overcome these limitations, more advanced non-contact detection technologies have been developed, such as 3D laser scanning and infrared thermal imaging. Three-dimensional laser scanning can obtain high-precision surface point cloud data [3], but its equipment is expensive and data processing is complex [4]. Infrared thermal imaging detects near-surface anomalies through thermal radiation characteristics, but is significantly affected by ambient temperature [5,6]. The above technologies have improved the detection efficiency and objectivity to a certain extent, but they still have their own limitations in terms of cost, applicability or information dimension. In this context, digital image recognition technology, especially the method combining computer vision and deep learning, is becoming a promising research and application direction in the field of structural damage assessment because it can process massive image data in a non-contact, automated and efficient manner and realize intelligent crack positioning, segmentation and quantitative analysis [7].

Early research on intelligent identification of rock fractures was mainly based on traditional digital image recognition technology. The core idea of traditional rock image recognition methods is to segment rock cracks and background (intact rock mass) based on the differences in image grayscale, texture, geometry or edge features and to extract crack features more accurately and completely. There are many traditional rock fracture image recognition methods, the mainstream methods include threshold segmentation method [8], edge detection method [9] and region detection method [10]. The threshold segmentation method binarizes the rock mass image and separates the fracture area by setting the global or local grayscale threshold. Deb et al. [11] used digital images from photogrammetry to analyze the geometric morphology of exposed rock surface structures. At the same time, they combined the Hough transform algorithm to improve the effect on rock masses with large noise and achieved fully automatic extraction of crack information. Reid et al. [12] applied digital image processing technology to propose a semi-automatic detection method for rock surface trajectory based on grayscale digital images. However, rock surfaces often suffer from uneven lighting, shadows, stains, and other interference, which can lead to poor segmentation using a single threshold. This can easily cause breakage or adhesion, affecting the accuracy of crack identification. Edge detection methods use mathematical morphology to preprocess crack images, use operators such as Canny and Sobel, along with morphological methods, to perform a rough segmentation of the crack image, and then use the region extension algorithm to perform more accurate segmentation on the roughly segmented image. Bolkas et al. [13] evaluated and compared different edge detection algorithms and spatial frequency domains and achieved automatic extraction of exposed rock cracks. Li et al. [14] used normal tensor voting theory to detect the feature points of discontinuity traces and proposed an automatic mapping method for discontinuity traces based on a three-dimensional surface model of the rock mass. Liu et al. [15] used the least squares support vector machine (LS-SVM) to estimate the grayscale value of each pixel neighborhood in the CT image and proposed an edge detection method based on the combination of gradient and zero-crossing results. In summary, traditional rock image recognition methods such as threshold segmentation and edge detection can basically achieve the segmentation and extraction of rock joints and fissures, but overall it is still difficult to overcome the challenges brought by uneven exposure, complex background and diversity of crack morphology, and the recognition accuracy and generalization ability are limited.

Deep learning has excellent performance in performance, recognition accuracy, generalization ability and robustness, so it is favored by many scientific researchers, and image recognition and segmentation methods based on deep learning have also emerged. Zhang et al. [16] enhanced the U-Net architecture by incorporating dilated convolutions to improve the identification of fractures in steep rock joint slopes and enable automated extraction of crack surface geometric parameters. The proposed model achieves higher recognition accuracy than the conventional U-Net network. Li et al. [17] used the CNN-based U-net and DeepLabv3+ model to describe the crack traces of hard rock pillars in underground space, combined with chain code technology and broken line approximation algorithm to identify and quantify crack length, inclination and other characteristics. Hanat et al. [18] improved the FCN network based on the attention mechanism and achieved the recognition of complex cracks on the concrete surface. Cha et al. [19] combined the region growing method with a convolutional neural network to identify cracks, but could only identify image sub-blocks containing cracks and could not capture the exact location of the cracks. Park et al. [20] iteratively optimized the crack segmentation model based on deep reinforcement learning and refined the pixel-level prediction results through the RL agent, making the crack boundaries more coherent and less missed. Chen et al. [21] proposed a method for identifying crack images using an NB-CNN network that combined a convolutional neural network with a naive Bayesian data fusion scheme. Ji et al. [22] proposed an automatic rock fracture identification method based on the YOLACT++ deep learning model, providing a novel solution for achieving efficient and reliable intelligent interpretation of rock mass fractures. Chen et al. [23] proposed an improved YOLOv5 model by introducing a Ghost module into the backbone network and incorporating a coordinate attention mechanism, achieving a synergistic optimization of crack detection speed and accuracy, which enables effective real-time localization of fractures. Zhang et al. [24] proposed a method that integrates a clustering algorithm with a region-growing algorithm (RGA) to extract and match feature points located on rock discontinuities. Li et al. [25] proposed a deep learning-based complex rock fracture segmentation network (CRFSSegNet), which integrates a multidimensional feature calculation method to achieve end-to-end processing from image segmentation of multi-source datasets containing both natural and blast-induced fractures to geometric parameter extraction. Zhang et al. [26] improved the loss function of the U-Net network and proposed a crack segmentation model called CrackUnet. The above research shows that deep learning technology is working to break through the limitations of traditional image processing in order to improve the accuracy, robustness and applicability of crack identification.

Data augmentation plays a significant role in expanding datasets. It enhances the diversity and richness of training samples by transforming and processing original images, thereby improving the performance and robustness of deep learning models. In practice, acquiring certain training datasets is challenging, and different data augmentation strategies vary in their effectiveness for enhancing model performance. Existing image data augmentation methods can be divided into two categories: traditional image data augmentation techniques, such as random flipping and random cropping of images. For example, Lopes et al. [27] proposed a method that first selects a square region in the target image and then adds a square Gaussian noise patch to generate new images. By adjusting the patch size and the maximum standard deviation of the sampled noise, interpolation between Gaussian noise and Cutout can be achieved. Gong et al. [28] introduced KeepAugment to improve the fidelity of generated images. This method measures the importance of each region in an image and uses this as a basis for partitioning the image. During augmentation, operations on these regions are avoided to preserve critical information. The other category is deep learning-based image data augmentation. For instance, Cubuk et al. [29] proposed AutoAugment (AA) for automatically searching improved data augmentation strategies and designed a search space for the algorithm. Each strategy consists of multiple sub-strategies, and for each image in every mini-batch, a sub-strategy is randomly selected. Huang et al. [30] proposed an Illumination-Aware Two-stage Network (IATN) for low-light image enhancement, which reduces color distortion and suppresses noise in the results, thereby achieving refined enhancement outcomes.

Based on the above, although existing studies have achieved certain results in image processing and deep learning crack recognition, facing the challenges of multi-scale, low-contrast, and complex background of rock cracks, recognition accuracy and model generalization ability are still problems that need to be solved urgently. Therefore, based on the Mask R-CNN framework, this paper integrates the CBAM attention mechanism and the BiFPN bidirectional feature fusion module to construct an improved crack recognition model CBAM-BiFPN-Mask R-CNN network with both adaptability and robustness. The recognition performance of this constructed model in complex rock mass scenarios is verified through measured image data, providing an efficient and intelligent solution for rock structure crack identification and safety assessment.

2. Materials and Methods

2.1. Dataset Construction

The image dataset constructed in this study is designed to address the highly complex task of automatic identification and precise segmentation of rock joints and fractures in natural environments. This complexity stems from the fact that fracture images are influenced by multiple factors, including rock type, geological origin, structural composition, engineering context, image acquisition methods, and sampling scales. Furthermore, the presence of multiple intersecting joint sets within a single sampling window significantly increases the spatial complexity and recognition difficulty of fracture structures.

To address these challenges, this study has constructed an image dataset that encompasses multiple working conditions, various lithologies, and multi-scale fracture characteristics. The images were collected from field surveys conducted in multiple geographic regions, including the Sanshandao Gold Mine in Shandong Province and the Ma’anshan Mining Area in Liaoning Province, covering various geological environments such as open-pit mines and rock slopes. In terms of lithology, the dataset includes different rock backgrounds such as granite and tuff. Regarding fracture morphology, it comprehensively contains diverse structural patterns ranging from simple linear fractures and intersecting multi-group fractures to complex networked fracture systems, fully covering the major fracture patterns encountered in field practice. The dataset exhibits rich structural types and wide scale distribution of fractures. Additionally, all images were acquired under natural lighting conditions during field collection, inherently incorporating illumination variations caused by different times of day, weather conditions, and occlusions, which effectively enhances the model’s robustness in practical application scenarios. The dataset comprises a total of 257 rock surface photographs taken with mobile phones across different regions. After standardizing the image size to 1024 × 1024, the data volume was expanded fourfold through cropping, resulting in a final library of 1028 image samples, each sized 1024 × 1024 (see Figure 1a). All images were manually annotated using the open-source software LabelMe 3.16.2 to generate instance segmentation annotations (see Figure 1b). The annotation results precisely mark the pixel locations and identification numbers of each joint and fracture, while unannotated fracture areas on the rock surface are automatically classified as background. Corresponding instance segmentation mask images for joints and fractures were also created. The training data is saved in (.json) format.

2.2. Raw Image Preprocessing

Taking into account the problems of uneven brightness and noise interference in rock images under different shooting environments, a single image distribution is difficult to cover all feature change scenarios. In order to improve the robustness and generalization ability of the model in complex environments, this study introduces a variety of data enhancement strategies to preprocess the original images. Specifically, dynamic enhancement operations such as image blurring, noise overlay, and brightness and saturation changes are randomly combined during training, ensuring that each round of training is exposed to fracture image representations under a variety of simulation scenarios. This enhancement mechanism shows potential for alleviating the problem of small sample data and improving the model’s adaptability to complex real-world fractures.

High-quality and large-volume datasets are key factors in achieving high accuracy when training network models. Therefore, this study employed a variety of dynamic data augmentation methods, as shown in Table 1: Gaussian blur, median blur, mean blur, Gaussian noise, saturation, contrast, and brightness. These methods are randomly applied during each training iteration. By simulating a variety of imaging conditions (such as reflective rock surfaces under strong light and wet rock surfaces after rain), the model’s generalization ability in complex field scenes is enhanced. By randomly combining different enhancements (such as Gaussian blur and saturation), multiple image variants are generated during each training session to address the small sample size of rock fracture datasets. Figure 2 shows the effects of adding a single data enhancement method to an image, such as Gaussian blur, Gaussian noise, and brightness. Figure 3 shows the effects of adding a combination of data enhancement methods to an image, such as brightness combined with contrast, brightness combined with saturation, Gaussian blur combined with saturation, and contrast combined with saturation. Figure 4 shows a performance comparison of different enhancement methods. The radar chart comprehensively evaluates enhancement performance based on five metrics: mean square error (MSE), structural similarity (SSIM), peak signal-to-noise ratio (PSNR), edge preservation, and noise control. The polygon area reveals the comparative relationship between the enhancement strengths of different methods.

2.3. CBAM-BiFPN-Mask R-CNN Network

Mask R-CNN is a classic model in the field of object detection and instance segmentation. It adds a mask prediction branch based on Faster R-CNN to achieve pixel-level accurate segmentation [31]. In response to the specific challenges posed by rock fracture images—such as complex backgrounds, significant scale variations, and irregular morphologies—this study introduces systematic improvements to the classic framework. Specifically, we propose a synergistic architecture integrating CBAM and BiFPN. On one hand, the hierarchical attention mechanism preserves feature saliency across different network depths: shallow CBAM modules enhance edge and textural details of fractures, while deeper CBAM modules focus on semantic information and global positioning, effectively preventing the loss of small fracture features in deep layers. On the other hand, BiFPN introduces learnable weighting parameters to adaptively adjust the contribution of multi-scale features during fusion. This weighted bidirectional fusion strategy is more rational than the equal-weight fusion used in traditional FPN, leading to more efficient multi-scale feature integration.

As shown in Figure 5, the improved fracture segmentation framework adopts a four-stage cascaded structure: First, ResNet50 is used as the backbone network, with Convolutional Block Attention Module (CBAM) embedded after each residual block to form a hierarchical attention mechanism—shallow layers enhance fracture edge and texture features, while deeper layers strengthen semantic information and global localization capability. Second, the traditional Feature Pyramid Network (FPN) is replaced with Bidirectional Feature Pyramid Network (BiFPN), which achieves adaptive multi-scale feature fusion through learnable weights and cross-scale bidirectional connections. The Region Proposal Network (RPN) employs sliding windows and anchor mechanisms to generate high-quality candidate regions. Subsequently, RoIAlign eliminates the quantization errors of traditional RoIPooling via bilinear interpolation, precisely mapping variable-sized candidate regions to fixed-resolution feature maps. Finally, the model employs multi-task learning to simultaneously output object categories, refined bounding box coordinates, and high-precision binary masks. This end-to-end architecture synergistically optimizes attention mechanisms and feature pyramid integration, effectively combining low-level detailed features with high-level semantic information, thereby significantly enhancing the localization and segmentation performance for complex-shaped targets such as fractures [31].

2.3.1. CBAM Attention Mechanism

In deep learning, the attention mechanism simulates the selective attention mechanism of human vision, enabling the neural network to dynamically focus on the key areas of the input data, thereby improving the model’s ability to perceive important features. In this study, the Convolutional Block Attention Module (CBAM) [32] is embedded into the ResNet50 backbone network to address the particularities of rock fracture images (such as complex background noise, multi-scale fracture distribution, and weak edge features). As a lightweight hybrid attention mechanism, CBAM is plug-and-play and lightweight as an independent module without modifying the main network structure. CBAM only adds a small amount of computational overhead, but improves the Intersection over Union (IoU) index of fracture segmentation (based on a self-made rock mass dataset test). Through the synergistic effect of channel attention and spatial attention, it achieves precise enhancement of fracture features and interference suppression. Channel attention aggregates the channel information of the feature map through global average pooling and maximum pooling to generate a channel weight vector, which enables the network to adaptively enhance the feature channels related to cracks (such as high-frequency edge responses) while suppressing irrelevant channels (such as uniform texture of rock mass). Based on channel attention, spatial attention generates a spatial weight map through convolution operations, forcing the model to focus on local areas of crack distribution (such as linear cracks and mesh crack groups), and is committed to reducing the interference of background noise (such as shadows and surface stains).

In the task of identifying rock cracks, through end-to-end learning, the key morphological features of cracks such as strike continuity and width changes are automatically identified while reducing the response to non-crack areas (such as rock particles and water stains). Combined with the FPN structure, CBAM dynamically adjusts the attention weights on feature maps at different levels to ensure that the model simultaneously captures the detailed features of large-scale crack zones (macroscale) and microcracks (microscale).

The channel attention mechanism automatically enhances task-related channels and suppresses noise by learning the importance differences between channels and adaptively recalibrating channel features accordingly. By introducing the SE-module, Hu et al. [33] significantly improved the feature expression capability. The channel attention module in CBAM also adopts a similar strategy and has been proven to be able to be embedded in networks such as Mask R-CNN to improve performance [32]. Channel features are adaptively recalibrated. In the improved Mask R-CNN, each convolution kernel is responsible for extracting a specific feature pattern of the input data (such as texture, edge, color, etc.), and its output constitutes a channel of the feature map. However, not all channels are equally important for solving specific tasks (such as image classification and object detection); some channels may contain discriminative information that is more relevant to the current task, while other channels may contain redundant information or noise. The channel attention mechanism aims to automatically identify and enhance these information-rich feature channels while suppressing channels with less information or no importance.

As shown in Figure 6, first, input the feature map F, perform global average pooling and global maximum pooling on the input feature map, respectively, and obtain two 1 × 1 × C feature maps. The two feature maps are, respectively, sent to a shared fully connected layer (consisting of two fully connected layers with a ReLU activation function in between). The outputs of the two fully connected layers are added, and the channel attention map Mc∈RC × 1 × 1 is obtained through the Sigmoid activation function. The channel attention map Mc is element-wise multiplied with the input feature map F to obtain the weighted feature map F′. The calculation formula is shown in Formulae (6) and (7):

M c = σ \{M L P [A v g P o o l (F)] + M L P [M a x P o o l (F)]\}

(1)

F^{'} = M c^{*} (F)

(2)

In the formula, σ represents the Sigmoid function, AvgPool represents the global average pooling, and MaxPool represents the global maximum pooling.

The goal of the spatial attention mechanism is to learn and model the importance differences in different spatial locations in the feature map and adaptively weight the spatial regions of the feature map accordingly. Unlike channel attention, which focuses on important feature channels, spatial attention focuses on important locations in the feature map. This is particularly important for tasks such as object detection (focusing on target regions), semantic segmentation (focusing on foreground objects), or image description (focusing on salient regions in an image), since the distribution of relevant information is usually spatially non-uniform [32,34].

The spatial attention module inputs the feature map F′ processed by the channel attention module, performs average pooling and maximum pooling on F′ in the channel dimension, and obtains two H × W × 1 feature maps. The two feature maps are spliced in the channel dimension to obtain a new feature map Fcat∈RH × W × 2. A 7 × 7 convolution kernel (usually with appropriate padding to keep the spatial size unchanged) is used to convolve Fcat to generate a spatial attention map Ms∈RH × W × 1. The convolution result is converted into a spatial attention map Ms through the Sigmoid activation function. The spatial attention map Ms is element-wise multiplied with the feature map F′ processed by the channel attention module to obtain the final output feature map F″. The calculation formula is shown as follows:

P I = M a x P o o l (F^{'})

(3)

P 2 = A ν g P o o l (F^{'})

(4)

M_{S} {= {f}^{7 \times 7} * [P I; P 2]}

(5)

F^{″} = M c^{*} (F^{'})

(6)

where * indicates the element-by-element multiplication of two matrices and f 7 × 7 indicates that the convolution kernel is 7 × 7.

2.3.2. Weighted Bidirectional Feature Pyramid Network

In the field of deep learning, multi-scale feature fusion has always been a key technology for improving the performance of object detection and instance segmentation. The FPN [35] transfers high-level semantic features to low-level layers through a top-down path. Although it achieves the fusion of multi-scale features to a certain extent, its one-way information flow limits the complementary role of low-level detailed features to high-level features [36]. The subsequent Path Aggregation Network (PANet) improved this limitation by adding bottom-up paths [37], but its feature fusion process still lacks the ability to adaptively evaluate the importance of features at different scales. These limitations are particularly evident in complex tasks such as rock crack detection, because rock cracks often have irregular shapes, multi-scale distributions, and low contrast. Traditional feature fusion methods are difficult to fully capture these complex features.

To address these issues, this study introduced a weighted bidirectional feature pyramid network (BiFPN) into the Mask R-CNN framework. This network constructs a bidirectional feature flow path to enable deep interaction and fusion between high-level and low-level features [38]. Specifically, high-level semantic features guide the semantic enhancement of low-level features through upsampling operations, while low-level detail features optimize the positioning accuracy of high-level features through downsampling operations. This is because the top-down path helps to transmit high-level semantic information to the lower layers, enhancing the understanding of the coherence of the fracture lines; at the same time, the bottom-up path can preserve the fine geometric features of microcracks, preventing these important information from disappearing in the deep network. Therefore, this two-way interaction mechanism is suitable for rock fracture detection tasks. Furthermore, BiFPN also introduces learnable weight parameters. This dynamic weighting strategy can automatically evaluate the importance of features at different levels, prioritize high-confidence features in the crack intersection area, and effectively suppress interference from background areas. The specific BiFPN structure is shown in Figure 7 below [32].

P^{o u t} = f (P^{i n})

(7)

& P_{3}^{t d} = C o n v (\frac{ω_{1} \cdot P_{3}^{i n} + ω_{2} \cdot R e s i z e (P_{4}^{i n})}{ω_{1} + ω_{2} + \in})

(8)

P_{3}^{o u t} = C o n v (\frac{ω_{1}^{'} \cdot P_{3}^{i n} + ω_{2}^{'} \cdot P_{3}^{t d} + ω_{3}^{'} \cdot R e s i z e (P_{2}^{o u t})}{ω_{1}^{'} + ω_{2}^{'} + ω_{3}^{'} + \in})

(9)

In the formula,

P_{3}^{t d}

and

P_{3}^{o u t}

are the intermediate features and output features of the third level, respectively;

ω_{1}

and

ω_{2}

are the learnable weights for the third-level feature fusion, indicating the importance of the third-level input features and the adjusted fourth-level input features, respectively;

ω_{1}^{'}

,

ω_{2}^{'}

,

ω_{3}^{'}

are the learnable weights for the third-level output feature fusion, indicating the importance of the third-level input features, the third-level intermediate features, and the adjusted second-level output features, respectively;

P_{3}^{i n}

and

P_{4}^{i n}

represent the input features of the third and fourth levels, respectively;

P_{2}^{o u t}

is the output feature of the second level;

R e s i z e

is the upsampling or downsampling operation,

C o n v

is the separable convolution,

\in

= 0.0001.

To sum up, BiFPN introduces learnable weights to measure the importance of different input features and utilizes top-down and bottom-up multi-scale feature fusion to further improve the detection accuracy and robustness of complex cracks.

3. Experiment

3.1. Experimental Environment

During the training process, the deep learning framework Pytorch version 1.10.1, CUDA version 11.3, and CUDNN version 8.2.0 were used to build the rock fracture identification model under the Windows system. The entire algorithm was implemented in Python 3.10. The hardware information of this experiment is detailed in Table 2.

3.2. Evaluation Indicators

This study uses precision (P), recall (R) and mean average precision (mAP) when the IoU threshold is 0.5 as evaluation indicators to measure the performance of the model. The calculation formulas for each indicator are

P = \frac{T_{P}}{T_{P} + F_{P}}

(10)

R = \frac{T_{P}}{T_{P} + F_{N}}

(11)

m_{A P} = \frac{\sum_{i = 1}^{N} A_{P}}{N}

(12)

F 1 = \frac{2 * T P}{2 * T P + F P + F N}

(13)

where

T_{P}

is the number of positive samples whose training results are positive,

F_{P}

is the number of negative samples whose training results are positive,

F_{N}

is the number of missed detections by the training model,

m_{A P}

is the average precision, and

N

is the number of categories of detection features.

3.3. Experimental Setup

This study designed a systematic comparative experiment based on the Mask R-CNN framework, verifying the impact of different modules on rock fracture detection performance through a progressive improvement strategy. The experiments used a unified benchmark setup to ensure comparability of the results. All models used ResNet50 as the backbone network and were initialized with pre-trained ResNet50 weights. The image data is divided into training set, validation set and test set according to the ratio of 7:2:1. The input images are uniformly adjusted to 1024 × 1024 pixels and scaled while maintaining the original aspect ratio. The insufficient parts are filled with zeros. Training was performed using the SGD optimizer with an initial learning rate of 0.01, a cosine annealing learning rate schedule, a momentum parameter of 0.9, and a weight decay coefficient of 5 × 10⁻⁴. Each training batch contained eight images, and training was completed on a single NVIDIA V100 GPU for 100 epochs. Early stopping was used to monitor the validation set mean average performance (mAP) to prevent overfitting.

To enhance the model’s generalization capabilities, various data augmentation methods were employed during training. Regarding model improvement, this study designed four progressive experimental schemes: using the basic Mask R-CNN model as a baseline; integrating the CBAM attention module onto the baseline model; replacing the original FPN with BiFPN; and finally, a composite improvement scheme that incorporates both CBAM and BiFPN. All improved models maintained the same training settings and hyperparameters, ensuring that performance differences were solely due to changes in network structure. Standard COCO evaluation metrics were used during the evaluation phase, focusing on key performance indicators such as precision, recall, and mAP, to comprehensively analyze the effectiveness of each improvement strategy.

4. Result Analysis

4.1. Model Training Results Analysis

Figure 8 compares the loss function convergence curves and learning rate curves of the improved Mask R-CNN model based on CBAM-BiFPN and the baseline model during training. Experimental results show that all models exhibit a stable exponential decay trend within 100 epochs, with the training loss ultimately stabilizing in the range of 0.20–0.25. In the initial phase (epochs 0–20), the loss value rapidly decreases by approximately 85%, corresponding to the learning rate remaining at its initial value (0.01), allowing the model to learn coarse-grained features. In the mid-stage (epochs 20–60), the slope of the loss decreases, coinciding with the adjustment of the cosine annealing learning rate, validating the effectiveness of the dynamic learning rate in escaping local optima. In the later stages (epochs 60–100), the loss curve flattens out, ultimately fluctuating within ±0.02, indicating that the model has reached a stable convergence state.

During model training, all four models stabilized after the 40th epoch, demonstrating good convergence. In Figure 9, the baseline Mask R-CNN model achieved a mAP@0.5 value of 74.09%, serving as a comparison for subsequent improvements. After introducing the CBAM attention mechanism, the performance of the Mask R-CNN-CBAM model improved to 75.72%, a 2.20 percentage point improvement over the baseline model. Mask R-CNN-BiFPN, which utilizes the BiFPN feature pyramid network, achieved a mAP@0.5 value of 78.63%, a 4.54 percentage point improvement over the baseline model. The improved Mask R-CNN-CBAM-BiFPN model, which combines CBAM and BiFPN, achieved a mAP@0.5 value of 82.45%, an 8.36 percentage point improvement over the original Mask R-CNN. This result fully demonstrates the synergistic effect of the attention mechanism and the multi-scale feature fusion strategy.

4.2. Recognition Result Analysis

To validate the performance of the improved Mask R-CNN model in rock fracture identification, 100 images were selected from the original rock fracture dataset for evaluation. Figure 10 presents the identification results of rock fracture images obtained by different model methods. Comparative analysis shows that in simple fracture detection scenarios (Sample 1), all compared models demonstrate satisfactory baseline performance, with no significant difference observed compared to the improved model proposed in this study. This phenomenon confirms that existing methods have achieved stable capabilities in extracting explicit fracture features. When targets possess clear edge characteristics with minimal background interference, conventional prediction models can successfully accomplish the identification task.

When dealing with low-contrast linear fractures (Sample 2), where the average pixel value of fractures is lower than the background grayscale, the average missed detection rate of the five compared models reaches 42.3%. In contrast, the improved model exhibits outstanding performance in maintaining continuous detection along fracture segments, further reducing breakage rates compared to the best baseline model. This improvement primarily stems from the channel attention mechanism in CBAM, which adaptively recalibrates feature channel weights to enhance the feature representation of low-contrast fractures, while BiFPN’s multi-scale bidirectional feature fusion capability ensures the preservation of fracture continuity.

For detecting fracture intersection areas and tip features (Sample 3), experimental results demonstrate significant advantages of the improved model. In test samples containing intersecting fractures, the improved model achieves sub-pixel level accuracy in locating fracture tips. This benefits from BiFPN’s efficient multi-scale feature fusion mechanism, which simultaneously leverages semantic information from deep networks and detailed information from shallow networks. Combined with CBAM’s attention focus on key regions, the model achieves precise analysis of complex fracture intersections, providing substantial value for engineering applications such as rock stability analysis.

In images with strong background interference and blurring (Sample 4), where the grayscale difference between fractures and background is less than 10, compared models show varying degrees of fracture breakage in their identification results. The improved model, through dynamic weighting of suspected regions, demonstrates more accurate semantic understanding of boundaries. The CBAM attention mechanism plays a crucial role here, with its spatial attention module suppressing background interference while the channel attention module enhances feature channels related to fractures, enabling the model to maintain detection performance under extremely low-contrast conditions.

Finally, in scenarios involving fine fractures and complex intersections (Sample 5), the improved model demonstrates comprehensive performance enhancement for detecting micro-fractures with small widths. BiFPN’s learnable weighting parameters automatically balance the contribution of different scale features to micro-fracture detection, while CBAM enhances the feature representation of micro-fractures through its dual attention mechanism. The synergistic effect of both components significantly improves the model’s sensitivity to fine fractures.

These experimental results collectively demonstrate that the proposed improvements maintain performance in simple scenarios while enhancing the robustness and accuracy of complex fracture detection. From the perspective of feature representation, the CBAM attention mechanism improves the model’s focus on critical fracture regions by recalibrating feature responses in both channel and spatial dimensions. Meanwhile, BiFPN addresses the information loss issue in traditional FPN during feature transmission through efficient multi-scale feature fusion. The combination of these two components not only enhances the model’s sensitivity to fracture features at the feature extraction level but also optimizes the representation capability of multi-scale features at the feature fusion level, providing strong technical support for fracture identification in complex scenarios.

4.3. Model Performance Analysis

Table 3 presents the performance comparison of different models on the rock fracture detection task. The comprehensive metrics demonstrate that the baseline Mask R-CNN model outperforms YOLOV8 (mAP@[0.5:0.95]:35.6%, P:70.6%, R:62.7%, F1:0.664) and DeepLabV3 (mAP@[0.5:0.95]:38.5%, P:72.3%, R:65.8%, F1:0.689) across all key indicators, achieving 38.1% in mAP@[0.5:0.95], 73.5% in precision, 67.6% in recall, and 0.704 in F1-score, confirming its superior baseline capability for fracture detection.

After integrating the CBAM attention mechanism into Mask R-CNN, the model shows significant improvements with mAP@[0.5:0.95] increased to 39.5%, precision to 75.9%, recall to 70.4%, and F1-score to 0.731, validating the effectiveness of attention mechanisms in enhancing fracture feature representation. The model equipped with BiFPN feature pyramid network also demonstrates performance gains, achieving 42.7% mAP@[0.5:0.95], 78.8% precision, 75.5% recall, and 0.771 F1-score. Ultimately, the improved model incorporating both CBAM and BiFPN achieves the best performance with 45.9% mAP@[0.5:0.95], 82.6% precision, 80.3% recall, and 0.814 F1-score, representing improvements of 7.8 percentage points, 9.1 percentage points, 12.7 percentage points, and 0.11, respectively, compared to the baseline Mask R-CNN.

The confidence comparison results shown in Figure 11 further verify the performance differences among models. YOLO series models exhibit relatively poor confidence performance with dispersed distributions and lower average values, consistent with their weaker quantitative metrics. In contrast, the integrated model based on Mask R-CNN demonstrates significant improvement, showing not only substantially higher prediction confidence than the baseline model but also more concentrated and stable distributions, indicating more accurate and reliable rock fracture identification.

Experimental results confirm that the synergistic design of attention mechanisms and multi-scale feature fusion effectively enhances model performance, providing an effective solution for accurate rock fracture identification in complex geological environments. Both quantitative metrics and confidence analysis demonstrate that the proposed improved model maintains high precision while achieving more reliable detection stability.

5. Conclusions

This paper proposes an improved Mask R-CNN model that integrates the CBAM attention mechanism and the BiFPN feature fusion module and conducts a systematic comparative experiment on a self-built rock fracture image dataset. The main conclusions are as follows:

(1): The improved model significantly outperforms the baseline Mask R-CNN model in key performance indicators such as precision and recall rate mAP of fracture recognition. Among them, the precision (P) of the improved model reaches 82.6% and the recall rate (R) reaches 80.3%, which are 9.1 percentage points and 12.7 percentage points higher than the baseline model, respectively; the mAP@0.5 indicator is improved by 8.36 percentage points, and the overall recognition performance is improved.
(2): The CBAM module, through the combined effects of channel attention and spatial attention, can adaptively enhance the discriminative features related to cracks and suppress background noise, thereby improving the model’s recognition ability under low contrast and complex lighting conditions; the BiFPN module, through bidirectional cross-layer feature fusion and learnable weight distribution mechanism, achieves efficient integration of multi-scale features, ensuring that the model can maintain the detailed features of microcracks while taking into account the global coherence of large-scale cracks. The synergistic effect of the two improves the stability and robustness of the model in different complex scenarios.
(3): The CBAM-BiFPN-Mask R-CNN model proposed in this paper shows higher accuracy and robustness in the detection of complex backgrounds, low contrast and fine cracks. It can more quickly and accurately determine the presence and location of cracks in images.
(4): The current self-constructed dataset exhibits considerable scale and diversity, and there remains potential for further expansion. Since the model was trained solely on this custom dataset, its generalizability (transferability) to other rock types or extreme environments requires further validation. Future research should focus on constructing a larger and more diverse cross-scenario rock fracture dataset to systematically evaluate and enhance the model’s generalization capability and robustness. Additionally, employing semi-supervised or self-supervised learning strategies would help leverage extensive unlabeled field image data, thereby reducing reliance on large-scale annotated datasets.

Author Contributions

Y.H. and Y.Y.: Writing—original draft preparation and investigation, Conceptualization, methodology, project administration, formal analysis. N.D.: methodology and supervision. Y.H. and Q.Z.: Conceptualization, methodology, project administration, resources, supervision, writing—review and editing, and funding acquisition. F.Y.: writing—original draft preparation, Data curation and visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This study was Supported by Open Research Fund Program of State key Laboratory of Hydroscience and Engineering (sk1hse-2024-D-02).

Data Availability Statement

The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Ferrero, A.M.; Forlani, G.; Roncella, R.; Voyat, H.I. Advanced Geostructural Survey Methods Applied to Rock Mass Characterization. Rock Mech. Rock Eng. 2009, 42, 631–665. [Google Scholar] [CrossRef]
Świt, G.; Krampikowska, A.; Tworzewski, P. Non-Destructive Testing Methods for In Situ Crack Measurements and Morphology Analysis with a Focus on a Novel Approach to the Use of the Acoustic Emission Method. Materials 2023, 16, 7440. [Google Scholar] [CrossRef]
Adhikari, M.D.; Kim, T.H.; Yum, S.G.; Kim, J.Y. Damage Detection and Monitoring of a Concrete Structure Using 3D Laser Scanning. Eng. Proc. 2023, 36, 1. [Google Scholar]
Mihić, M.; Sigmund, Z.; Završki, I.; Butković, L.L. An Analysis of Potential Uses, Limitations and Barriers to Implementation of 3D Scan Data for Construction Management-Related Use—Are the Industry and the Technical Solutions Mature Enough for Adoption? Buildings 2023, 13, 1184. [Google Scholar] [CrossRef]
Tran, Q.H.; Han, D.; Kang, C.; Haldar, A.; Huh, J. Effects of Ambient Temperature and Relative Humidity on Subsurface Defect Detection in Concrete Structures by Active Thermal Imaging. Sensors 2017, 17, 1718. [Google Scholar] [CrossRef]
Ko, T.; Lin, C.M.Y. A Review of Infrared Thermography for Delamination Detection on Infrastructures and Buildings. Sensors 2022, 22, 423. [Google Scholar] [CrossRef]
Yuan, Q.; Shi, Y.; Li, M. A Review of Computer Vision-Based Crack Detection Methods in Civil Infrastructure: Progress and Challenges. Remote Sens. 2024, 16, 2910. [Google Scholar] [CrossRef]
Sezgin, M.; Sankur, B. Survey over image thresholding techniques and quantitative performance evaluation. J. Electron. Imaging 2004, 13, 146–168. [Google Scholar] [CrossRef]
Zhang, D.; Lu, G. Review of shape representation and description techniques. Pattern Recognit. 2003, 37, 1–19. [Google Scholar] [CrossRef]
Loncaric, S. A survey of shape analysis techniques. Pattern Recognit. 1998, 31, 983–1001. [Google Scholar] [CrossRef]
Deb, D.; Hariharan, S.; Rao, U.; Ryu, C.-H. Automatic detection and analysis of discontinuity geometry of rock mass from digital images. Comput. Geosci. 2007, 34, 115–126. [Google Scholar] [CrossRef]
Reid, T.R.; Harrison, J.P. A semi-automated methodology for discontinuity trace detection in digital images of rock mass exposures. Int. J. Rock Mech. Min. Sci. 2000, 37, 1073–1089. [Google Scholar] [CrossRef]
Bolkas, D.; Vazaios, I.; Peidou, A.; Vlachopoulos, N. Detection of Rock Discontinuity Traces Using Terrestrial LiDAR Data and Space-Frequency Transforms. Geotech. Geol. Eng. 2018, 36, 1745–1765. [Google Scholar] [CrossRef]
Li, X.; Chen, J.; Zhu, H. A new method for automated discontinuity trace mapping on rock mass 3D surface model. Comput. Geosci. 2016, 89, 118–131. [Google Scholar] [CrossRef]
Liu, J.-H.; Jiang, Y.-D.; Zhao, Y.-X.; Zhu, J.; Wang, Y. Crack edge detection of coal CT images based on LS-SVM. In Proceedings of the International Conference on Machine Learning and Cybernetics, Baoding, China, 12–15 July 2009; p. 2398. [Google Scholar]
Zhang, Z.; Wang, S.; Wang, P.; Wang, C. Intelligent identification and extraction of geometric parameters for surface fracture networks of rocky slopes. Chin. J. Geotech. Eng. 2021, 43, 2240–2248. [Google Scholar]
Chuanqi, L.; Jian, Z.; Daniel, D. Utilizing semantic-level computer vision for fracture trace characterization of hard rock pillars in underground space. Geosci. Front. 2024, 15, 101769. [Google Scholar]
Tursenhali, H.; Hang, L. Intelligent identification of cracks on concrete surface combining self-attention mechanism and deep learning. J. Rail Way Sci. Eng. 2021, 18, 844–852. [Google Scholar]
Cha, Y.; Choi, W.; Büyüköztürk, O. Deep Learning-Based Crack Damage Detection Using Convolutional Neural Networks. Comput.-Aided Civ. Infrastruct. Eng. 2017, 32, 361–378. [Google Scholar] [CrossRef]
Park, J.; Chen, Y.-C.; Li, Y.-J.; Kitani, K. Crack detection and refinement via deep reinforcement learning. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021. [Google Scholar]
Chen, F.; Jahanshahi, M. NB-CNN: Deep Learning-Based Crack Detection Using Convolutional Neural Network and Naive Bayes Data Fusion. IEEE Trans. Ind. Electron. 2018, 65, 4392–4400. [Google Scholar] [CrossRef]
Ji, Y.; Song, S.; Zhang, W.; Li, Y.; Xue, J.; Chen, J. Automatic identification of rock fractures based on deep learning. Eng. Geol. 2025, 345, 107874. [Google Scholar] [CrossRef]
Chen, X.; Lian, Q.; Chen, X.; Shang, J. Surface Crack Detection Method for Coal Rock Based on Improved YOLOv5. Appl. Sci. 2022, 12, 9695. [Google Scholar] [CrossRef]
Zhang, P.; Du, K.; Tannant, D.D.; Zhu, H.; Zheng, W. Automated method for extracting and analysing the rock discontinuities from point clouds based on digital surface model of rock mass. Eng. Geol. 2018, 239, 109–118. [Google Scholar] [CrossRef]
Li, M.; Chen, M.; Lu, W.; Yan, P.; Tan, Z. Automatic extraction and quantitative analysis of characteristics from complex fractures on rock surfaces via deep learning. Int. J. Rock Mech. Min. Sci. 2025, 187, 106038. [Google Scholar] [CrossRef]
Zhang, L.; Shen, J.; Zhu, B. A research on an improved Unet-based concrete crack detection algorithm. Struct. Health Monit.-Int. J. 2021, 20, 1864–1879. [Google Scholar] [CrossRef]
Lopes, R.G.; Yin, D.; Poole, B.; Gilmer, J.; Cubuk, E.D. Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation. arXiv 2019, arXiv:1906.02611. [Google Scholar] [CrossRef]
Gong, C.; Wang, D.; Li, M.; Chandra, V.; Liu, Q. KeepAugment: A Simple Information-Preserving Data Augmentation Approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021. [Google Scholar]
Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. AutoAugment: Learning Augmentation Policies from Data. arXiv 2019, arXiv:1805.09501. [Google Scholar] [CrossRef]
Huang, S.; Dong, H.; Yang, Y.; Wei, Y.; Ren, M.; Wang, S. IATN: Illumination-aware two-stage network for low-light image enhancement. Signal Image Video Process. 2024, 18, 3565–3575. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Guo, M.-H.; Xu, T.-X.; Liu, J.-J.; Liu, Z.-N.; Jiang, P.-T.; Mu, T.-J.; Zhang, S.-H.; Martin, R.R.; Cheng, M.-M.; Hu, S.-M. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 2022, 8, 331–368. [Google Scholar] [CrossRef]
Wang, C.; Zhong, C. Adaptive Feature Pyramid Networks for Object Detection. IEEE Access 2021, 9, 107024–107032. [Google Scholar] [CrossRef]
Su, Q.; Zhang, G.; Wu, S.; Yin, Y. FI-FPN: Feature-integration feature pyramid network for object detection. Ai Commun. 2023, 36, 191–203. [Google Scholar] [CrossRef]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]

Figure 1. Annotation of fractures in rock masses: (a) Rock mass crack example; (b) Rock fracture annotation mask.

Figure 2. Results of individual enhancement methods. (a) Original image. (b) Gaussian blur. (c) Gaussian noise. (c) Gaussian noise. (d) Brightness.

Figure 3. Results of combined enhancement methods. (a) Brightness Contrast. (b) Brightness Gaussian noise. (c) Gaussian blur Median blur. (d) Mean blur Saturation.

Figure 4. Performance Comparison of Enhancement Methods: (a) Individual image enhancement methods; (b) Combined enhancement methods.

Figure 5. CBAM-BiFPN-Mask R-CNN structure.

Figure 6. CBAM attention mechanism.

Figure 7. The network structure of FPN and BiFPN: (a) FPN; (b) BiFPN.

Figure 8. Loss function and learning rate convergence curve: (a) Mask R-CNN; (b) Mask R-CNN-CBAM; (c) Mask R-CNN-BiFPN; (d) Mask R-CNN-CBAM-BiFPN.

Figure 9. Training mAP Progress.

Figure 10. Recognition results for the instance segmentation of rock fractures.

Figure 11. Scatter box plots showing the performance of different models: (a) Precision. (b) Recall.

Table 1. Image enhancement methods.

Enhancement Type	Name	Parameter Settings
Blurring	Gaussian blur	blur_limit = (3,7)
	Median blur	blur_limit = 3
	Mean blur	blur_limit = 3
Noise	Gaussian noise	var_limit = (10.0,50.0)
Color adjustment	Saturation	saturation = 0.5
	Contrast	contrast = 0.5
	Brightness	brightness = 0.3
Combined	Random 2-transform Combination	Random selection of 2 methods

Table 2. The experimental hardware.

Configuration	Parameter
CPU	Intel Core i9-13900 HX
GPU	NVIDIA GeForce RTX 4060 Laptop GPU
Development environment	Windows 10
Memory	32 G
Hard disk	1 TB

Table 3. Model performance evaluation metrics.

Methods	mAP@[0.5:0.95]	Precision(%)	Recall(%)	F1 Score
DeeplabV3	38.5%	72.3	65.8	0.689
YOLOV8	35.6	70.6	62.7	0.664
Mask R-CNN	38.1%	73.5	67.6	0.704
Mask R-CNN-CBAM	39.5%	75.9	70.4	0.731
Mask R-CNN-BiFPN	42.7%	78.8	75.5	0.771
Mask R-CNN-CBAM-BiFPN	45.9%	82.6	80.3	0.814

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hu, Y.; Deng, N.; Ye, F.; Zhang, Q.; Yan, Y. Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN. Buildings 2025, 15, 3516. https://doi.org/10.3390/buildings15193516

AMA Style

Hu Y, Deng N, Ye F, Zhang Q, Yan Y. Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN. Buildings. 2025; 15(19):3516. https://doi.org/10.3390/buildings15193516

Chicago/Turabian Style

Hu, Yu, Naifu Deng, Fan Ye, Qinglong Zhang, and Yuchen Yan. 2025. "Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN" Buildings 15, no. 19: 3516. https://doi.org/10.3390/buildings15193516

APA Style

Hu, Y., Deng, N., Ye, F., Zhang, Q., & Yan, Y. (2025). Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN. Buildings, 15(19), 3516. https://doi.org/10.3390/buildings15193516

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rock Surface Crack Recognition Based on Improved Mask R-CNN with CBAM and BiFPN

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Raw Image Preprocessing

2.3. CBAM-BiFPN-Mask R-CNN Network

2.3.1. CBAM Attention Mechanism

2.3.2. Weighted Bidirectional Feature Pyramid Network

3. Experiment

3.1. Experimental Environment

3.2. Evaluation Indicators

3.3. Experimental Setup

4. Result Analysis

4.1. Model Training Results Analysis

4.2. Recognition Result Analysis

4.3. Model Performance Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI