4.1. Datasets
Rissbilder Dataset [
41]. The Rissbilder dataset contains 2736 images of different types of cracks collected from buildings such as walls and bridges, along with their corresponding label images. The original resolution of the images in the dataset is
. The proportion of cracks in the images is approximately 2.70%, and some of the images have distortion issues. In the experiment, we randomly divided the dataset into training, validation, and test sets in a 6:1:3 ratio.
CrackTree200 Dataset [
42]. The CrackTree200 dataset contains 175 images of cracks collected from pavement and building surfaces, along with manually annotated label images. The original resolution of the images is
. The proportion of crack pixels in the images is relatively low, approximately 0.31%. Additionally, the images in this dataset contain interference factors such as shadows, occlusions, low contrast, and noise. For the division of the training, validation, and test sets, we followed a 6:1:3 ratio.
GAPS384 Dataset [
43]. The GAPS384 dataset contains 383 crack images collected during the summer of 2015 under dry and warm environmental conditions. Due to the specific constraints applied to the image collection environment, the dataset has minimal image noise caused by harsh weather conditions such as rain glare and mud obstruction. However, the dataset is still challenging due to random thickening distortions of the cracks. The original resolution of the images is
. Since the proportion of crack pixels in the images is 1.21%, it is classified as a low-crack-ratio dataset.
4.3. Implementation Details
All experiments were conducted on an Ubuntu 22.04 system. The hardware supporting this system includes an Intel i5-12600KF CPU, an NVIDIA 3090 GPU, and a 1200 W power supply. Additionally, the software supporting the training process includes Anaconda version 22.3.1, PyTorch framework version 2.0, and PyCharm version 2023.1 (community version). Before the training, we applied data augmentation strategies to improve the model’s performance, including random flipping, random cropping, and random deformation. For the hyperparameters in the experiments, we set the batch size to 4, resized all images to a resolution of , initialized the learning rate to 0.0001, set the total number of training iterations to 80,000, and selected the Adam optimizer to promote model convergence. We set the regularization parameter in the loss function to 0.4. We set the weights for the fusion factor of the high-frequency branch and spatial branch in the encoder to 0.6, respectively.
4.4. Comparison Experiments
To evaluate the segmentation performance of our proposed HFR-Mamba, we compared it with eight state-of-the-art methods as competitors including U-Net [
46], DeepLabv3 [
47], Vision Transformer [
33], Swin Transformer [
31], CrackSegNet [
48], DECSNet [
49], RHACrackNet [
50], and CarNet [
51] across the public Rissbilder, CrackTree200, and GAPS384 datasets. Among these competitors, U-Net, DeepLabv3, Vision Transformer, and Swin Transformer are commonly used benchmark methods for general semantic segmentation, while the CrackSegNet, DECSNet, RHACrackNet, and CarNet models are state-of-the-art methods specifically designed for crack segmentation tasks. To explore the performance of these methods, we recorded their results with and without loading of the pretrained weight from the ImageNet-1K dataset [
52]. Note that our proposed HFR-Mamba was not pretrained.
The performances of HFR-Mamba and other competitors on the Rissbilder dataset are recorded in
Table 1. We observed that our proposed HFR-Mamba achieves the SOTA performance on all metrics. In terms of CarNet, which ranked second and was specifically designed for crack segmentation, our proposed HFR-Mamba model outperformed it by 2.71%, 2.25%, 2.28%, 0.19%, and 2.04% in the mIoU, DSC, Recall, Acc, and Precision metrics, respectively. Additionally, we found that loading pretrained weights from the ImageNet-1K dataset significantly improves the model’s segmentation performance, suggesting that pretraining on large-scale datasets can enhance the model’s accuracy in segmenting crack structures. To provide a more intuitive comparison, we visualized the segmentation results of HFR-Mamba and the top-performing models, as shown in
Figure 4. From the comparison, we observed that HFR-Mamba is able to capture the fine-grained structures of cracks more accurately. In contrast, other methods, such as the Swin Transformer model, demonstrate suboptimal performance in segmenting curved areas, suggesting that general segmentation models require specialized techniques when dealing with cracks. Overall, our proposed HFR-Mamba model can accurately segmenting crack structures on the Rissbilder dataset.
To evaluate the performance of the proposed method on the CrackTree200 dataset, we recorded the scores of all models on the IoU, DSC, Recall, Accuracy, and Precision metrics. The CrackTree200 dataset primarily consists of tree-like crack images, which pose a challenge for models requiring global dependency modeling. As shown in
Table 2, our HFR-Mamba achieved an IoU score of 79.34%, a DSC score of 81.02%, a Recall score of 87.30%, an Accuracy score of 99.83%, and a Precision score of 78.94%, outperforming all other models and achieving state-of-the-art performance across all metrics. On a broader scale, the segmentation performance of Transformers outperforms that of CNNs, validating our hypothesis that extracting crack contextual features requires long-range dependency modeling. To provide a deeper comparison of the segmentation performance, we visualized the results of the proposed HFR-Mamba and the top-performing segmentation models. As shown in
Figure 5, the proposed HFR-Mamba accurately captures the tree-like crack structures, demonstrating its ability to model the relationships between crack structures and indicating its strong contextual modeling capability. In contrast, CarNet, which is specifically designed for crack segmentation, focuses solely on spatial feature extraction and neglects the advantages of frequency domain representations in capturing textures. As a result, it performs suboptimally in segmenting fine-grained features.
The segmentation performance of all models on the GAPS384 dataset is analyzed both qualitatively and quantitatively in
Table 3 and
Figure 6. By comparing with other methods, we found that the proposed HFR-Mamba achieved the highest scores across all metrics. Specifically, HFR-Mamba obtained 71.27%, 81.88%, 84.31%, 99.52%, and 77.58% in mIoU, DSC, Recall, Accuracy, and Precision, respectively. Compared to the dedicated segmentation model CarNet, HFR-Mamba outperforms it by 2.29%, 2.62%, 3.51%, 0.18%, and 1.76% in mIoU, DSC, Recall, Accuracy, and Precision, respectively. Furthermore, for the general segmentation method Swin Transformer, HFR-Mamba surpasses it by 3.16%, 4.41%, 4.40%, 0.32%, and 3.44% in mIoU, DSC, Recall, Accuracy, and Precision, respectively. By comparing these models designed for different purposes, the superiority of the proposed method is effectively validated.
Figure 6 visualizes the segmentation results of the proposed HFR-Mamba and the top-performing models. By comparison, we observed that HFR-Mamba exhibits a clear advantage in crack segmentation. It is capable of recognizing slender curves while preserving key fine-grained features. Therefore, the proposed method shows great potential for application in crack detection across other scenarios.
To provide a comprehensive comparison of all methods, we calculated their model parameters, FLOPs, FPS, and other efficiency-related metrics. As shown in
Table 4, the proposed HFR-Mamba has 22.1 M parameter counts. Compared to Transformer-based models, such as ViT, the Mamba-based approach shows a clear advantage. Although the number of parameters in HFR-Mamba is not the lowest, our aim in this work is to achieve more precise segmentation. Additionally, the proposed HFR-Mamba method has a FLOPs value of 21.9 G, indicating its higher computational efficiency. Regarding the FPS metric, the proposed method is only slightly lower than the lightweight CarNet, suggesting that HFR-Mamba demonstrates good real-time performance. Thus, our proposed HFR-Mamba can achieve more accurate segmentation performance with lower computational overhead.
4.5. Ablation Study
In this subsection, we conducted ablation experiments on the individual components of the proposed HFR-Mamba to evaluate their effectiveness. Given the complexity of the crack structures in the CrackTree200 dataset, all ablation experiments are performed on this dataset. First, we removed the high-frequency refinement branch from the encoder and used the remaining part as the backbone. As shown in
Table 5, by directly upsampling the segmentation results from the backbone, we achieved an IoU score of 76.74%. After incorporating the high-frequency features into the encoder, we observed a 1.08% increase in the IoU score. This validates the necessity of introducing high-frequency information into Mamba for crack segmentation. Next, we added the decoder. By comparing with the direct upsampling method, we found that the progressive decoder significantly improves crack segmentation, achieving a 0.54% increase in IoU. Additionally, we verified the contribution of the proposed SPA module. By comparing the performance of HFR-Mamba before and after adding the module, we found that the SPA module increased the IoU score by 0.66%. To visually assess the high-frequency branch, we performed a heatmap visualization of HFR-Mamba before and after adding the branch. As shown in
Figure 7, we observed that the high-frequency branch enables HFR-Mamba to capture changes in the fine-grained branches of cracks and improves the global continuity. Thus, these results demonstrate that each component of HFR-Mamba plays a crucial role in crack segmentation tasks.
We also conducted the corresponding ablation study for the hyperparameter
. As shown in
Table 5, adding extra supervision to the decoder’s layer features led to a further improvement. We also conducted ablation experiments on the weights of the supervisory branch in the decoder. As shown in
Figure 8, through comparison, we found that when
is set to 0.4, HFR-Mamba performs best across the three public datasets. This is because the higher weight for Dice Loss encourages the model to focus more on the overall integrity and consistency of the target region, which performs better for imbalanced data. Meanwhile, the lower weight for Cross-Entropy Loss ensures the proposed model maintains accuracy in boundary details without overfitting, thus improving the overall region segmentation. This setting helps enhance the overall performance of crack segmentation tasks, especially when dealing with complex backgrounds.
To explore the optimal fusion strategy of spatial and frequency domain features, we compared three methods: pixel-wise addition, tensor multiplication, and weighted fusion. As shown in
Table 6, we observed that by multiplying the aforementioned features, HFR-Mamba achieved IoU, DSC, Recall, Accuracy, and Precision scores of 78.82%, 80.07%, 86.45%, 99.71%, and 78.13%, respectively. Next, we experimented with adding the two feature sets together and found a slight improvement in segmentation accuracy. Subsequently, we employed a weighted fusion strategy to combine the two feature sets. We compared three strategies: emphasizing the frequency domain while weakening the spatial features (EFWS), equalizing the frequency and spatial features (EFS), and weakening the frequency domain while emphasizing the spatial features (WFES). We set the weight of the high-frequency branch as
. As shown in
Figure 9, when
, the proposed method shows relatively lower performance on the three public datasets. This indicates that the fusion approach, which amplifies spatial features while reducing the importance of frequency domain features, is not suitable for crack segmentation tasks. This is because the spatial domain may contain background and noise, and amplifying these features can interfere with the semantics in the high-frequency features. In contrast, when
, the proposed method performs better on the three public datasets. This once again confirms the necessity of introducing high-frequency features. Additionally, we observed that when
, the performance of the proposed method was better than when
. This suggests that excessively amplifying frequency domain features while compressing spatial features can also negatively affect the segmentation performance. Since we only introduce high-frequency features while discarding low-frequency features, this inevitably leads to the loss of some background-related semantics, thereby impacting the model’s ability to recognize the contextual features of the cracks. Therefore, when
, the encoder is able to effectively fuse spatial features and high-frequency semantics, improving the segmentation performance.
To evaluate the performance of the SPA module, we conducted ablation experiments on it. Specifically, we compared its structure with three commonly used attention modules: the Squeeze-and-Excitation (SE) attention module [
53], the Strip Pooling module [
54], and the self-attention module [
55] on the CrackTree200 dataset. As shown in
Table 7, replacing the SPA module with the SE module resulted in a decrease of 1.13%, 0.68%, 0.63%, 0.14%, and 0.70% in IoU, DSC, Recall, Accuracy, and Precision, respectively. This demonstrates that simply increasing the model’s attention to channel features is not the optimal approach for crack segmentation. Next, we compared two spatial attention methods, and found that the Strip Pooling and self-attention modules significantly improved the model’s segmentation performance over the channel attention module. Finally, we compared these methods with the proposed SPA module. By observation, we found that our method achieved state-of-the-art performance across all metrics. This indicates that incorporating both channel and spatial attention in the HFR-Mamba model is well-suited for crack segmentation tasks.
Additionally, we conducted ablation experiments on the resolution of the input images. As shown in
Table 8, we resized the original images to
,
,
, and
. From the comparison, we found that resizing the images to small resolutions significantly impacted the model’s accuracy in crack segmentation. This suggests that excessive downsampling disrupts the contextual relationships between crack structures, making precise segmentation more challenging. Due to memory constraints, we did not experiment with resolutions higher than
. Therefore, among all the tested resolutions, we found that
can effectively maximize the potential of HFR-Mamba for crack segmentation.